Report

Major Project-I Report
On
VIDEO CAPTIONING SYSTEM
Submitted in partial fulfillment of the requirement of
University of Mumbai for the Degree of
Bachelor of Technology
In
Information Technology
Submitted By
SoumodipDutta
Anuj Patil
Martin Joseph
Supervisor
Prof. Supriya Khaitan Chandra
Department of Information Technology

PILLAI COLLEGE OF ENGINEERING
New Panvel – 410 206
UNIVERSITY OF MUMBAI
Academic Year 2022– 23
i
DEPARTMENT OF INFORMATION TECHNOLOGY
Pillai College of Engineering
CERTIFICATE
This is to certify that the requirements for the Major Project-I entitled ‘Video Captioning System’
have been successfully completed by the following students:
Name Roll No.
Soumodip Dutta A610
Anuj Patil A637
Martin Joseph A629
in partial fulfillment of Bachelor of Technology of Mumbai University in the Department of

Information Technology, Pillai College of Engineering, New Panvel – 410 206 during the
Academic Year 2022 – 2023.
Supervisor
Prof. Supriya Khaitan Chandra
Head of Department Principal

Dr. Satishkumar L. Varma Dr. Sandeep M. Joshi
ii
DEPARTMENT OF INFORMATION TECHNOLOGY
Pillai College of Engineering
SYNOPSIS APPROVAL
This Major Project-I Synopsis entitled “Video Captioning System” by Soumodip Dutta, Anuj Patil
and Martin Joseph are approved for the degree of B.Tech. in Information Technology.
Examiners:
1.
2.
Supervisors:
1.
2.
Chairman:
1.
Date:
Place:
iii
Declaration
We declare that this written submission for B.Tech. Declaration entitled “Video Captioning System”
represents our ideas in our own words and where others' ideas or words have been included. We have
adequately cited and referenced the original sources. We also declared that we have adhered to all principles
of academic honesty and integrity and have not misrepresented or fabricated or falsified any ideas / data /
fact / source in our submission. We understand that any violation of the above will cause disciplinary action
by the institute and also evoke penal action from the sources which have thus not been properly cited or
from whom paper permission have not been taken when needed.
Project Group Members:
Soumodip Dutta:
Anuj Patil:
Martin Joseph:
Date:
Place:
iv
Table of Contents
Abstract................................................................................................................................ vii
List of Figures...................................................................................................................... viii
List of Tables....................................................................................................................... ix
1. Introduction................................................................................................................. 1
1.1 Introduction................................................................................................... 1
1.2 Scope........................................................................................................ 1
1.3 Problem Statement............................................................................................ 1
1.4 Outline……………………………….............................................................. 2
2. Literature Survey......................................................................................................... 3
2.1 Literature Review ………………………………………................................ 3
3. Proposed System…………………………………….................................................... 11
3.1 Overview…………………….......................................................................... 11
3.1.1 Proposed System Architecture………………………………………. 13

3.2 Methodology………................................................................................... 13
3.2.3 Sample Dataset Used… ............................................................................ 14
3.2.4 Hardware And Software Specifications… ............................................... 15
3.2.5 Evaluation Metrics………………………………………………… 15
4. Application… .................................................................................................................... 16
5. Summary… ....................................................................................................................... 18
References............................................................................................................................ 19
Acknowledgement……………………………………………………………………........ 20
Abstract
This project proposes a caption generator for a video on the web page that describes the video in
terms of "who does what and where", along with essential modules to provide a convenient
caption for the users' experience. Convolutional neural networks (CNNs), recurrent neural
networks (RNNs), and long short-term memory (LSTM) are some of the techniques that can be
helpful in the project. In this project, video datasets are being collected, and we also have the
option to upload the video on the web page. The main focus of the project is to play the video,
which gets divided into frames with the help of video preprocessing techniques. After the
completion of video preprocessing, it will generate the captions with the help of text generation
techniques. Post-processing techniques make captions more readable and coherent. This involves
many operations, such as removing repetitive or redundant phrases, rectifying the spelling and
grammar errors, adding punctuation for better understanding, and also applying the styles to the
captions, such as Times New Roman, Arial, etc.
vii
List of Figures
Figure 3.1 Existing System Architecture 12
Figure 3.2 Proposed system architecture 13
viii
List of Tables
Table 2.1 Summary of literature survey 8
Table 3.1 Sample Dataset Used 14
Table 3.2 Hardware details 15
Table 3.3 Software details 15
ix
Chapter 1
Introduction
1.1 Introduction
The Project “Video Captioning System” is designed to provide text descriptions of audio and
visual content to make videos accessible and effective for all people. Video captioning systems
typically involve features such as input processing, language modeling, training, evaluation,
and deployment, and use technologies like LSTM and CNN for accurate and relevant captions.
1. To generate accurate and coherent captions for videos.

2. To improve accessibility for individuals with hearing impairments.
3. To enhance user experience with synchronized captions.
4. To leverage LSTM for improved accuracy and reliability in captioning.
5. To simplify an easy-to-use interface.
1.2 Scope
Video captioning can make videos more accessible to people who are deaf or hard of hearing,
improve educational videos by providing subtitles in different languages, and enhance
entertainment content by providing captions. It can also be used in e-commerce for product
descriptions and reviews, in customer support applications for instructional videos, and in security
to monitor and analyze video content. Overall, the scope of video captioning using LSTM and
CNN is significant and continues to expand as technology advances, making it an important tool
for various industries
1.3 Problem Statement
● Video content is rapidly growing and becoming more popular online, leading to
increasing demand for automated video captioning systems. These systems are useful
for a range of applications. including accessibility for individuals with hearing
impairments. It’s estimated that over 5% of the world population, i.e 430 million people
suffers from disabling hearing loss. So, Video captioning is important because it makes
videos accessible to people with hearing impairments, improves user engagement, and
allows for better comprehension and understanding of the video content.
1
● Current video captioning systems often rely on simple text-based or audio-based
models, which can be insufficient in capturing the visual and audio information present
in videos. Therefore, there is a need for video caption generators that can accurately
capture the context and meaning of a video, while also accounting for the variability
and complexity of the visual and audio content.
1.4 Outline
The report is organized as follows: The introduction is given in Chapter 1. It describes the
fundamental terms used in this project. It motivates us to study and understand the different
techniques used in this work. This chapter also presents the outline of the objective of the report.
Chapter 2 describes the review of the relevant various techniques in the literature systems. It
describes the pros and cons of each technique. Chapter 3 presents the Theory and proposed work.
It describes the major approaches used in this work. The summary of the report is presented in
Chapter 5.
2
Chapter 2
Literature Survey
2.1 Literature Review
"Parallel Pathway Dense Video Captioning with Deformable Transformer" by Wangyu

Choi [1]: This paper proposes a novel approach for video captioning that uses parallel
pathways to capture both spatial and temporal features in the video. The model also
incorporates a deformable transformer to handle spatial deformations in the video frames. The
authors used the ActivityNet Captions and Charades-STA datasets for training and evaluation.
The features used include visual features extracted from the video frames, optical flow
features, and audio features. The authors used the BLEU-4, METEOR, ROUGE-L, and CIDEr
metrics to evaluate the model. The proposed model achieved state-of-the-art performance on
the ActivityNet Captions and Charades-STA datasets.
“Video Captioning Using Neural Networks” by Prathamesh Padmawar [2]: The article
explains the different model approaches used, such as encoder-decoder models, attention-based
models, and ensemble models. It also discusses the different components of a video captioning
system, including the feature extractor and the language model. The article further evaluates
the performance of different models based on metrics such as BLEU, METEOR, and ROUGE,
and provides an overview of the current state of the art in video captioning. Overall, the article
is a comprehensive guide to video captioning using neural networks and is a valuable resource
for researchers and practitioners in the field.
"Controllable Video Captioning with an Exemplar Sentence" by Tsinghua [3]: This paper
proposes a controllable video captioning approach that allows users to specify an exemplar
sentence to guide the caption generation process. The model generates captions that are both
informative and faithful to the exemplar sentence. The features used include visual features
extracted from the video frames, audio features, and a caption embedding vector. The authors
used the BLEU-4, METEOR, and CIDEr metrics to evaluate the model. The proposed model
achieved state-of-the-art performance on the MSR-VTT dataset.
3
"Object-Oriented Video Captioning via Structured Trajectory and Adversarial
Learning" by Fangyi Zhu [4]: This paper presents an object-oriented approach to video
captioning that generates captions based on the objects and their trajectories in the video. The
model uses adversarial learning to generate captions that are both informative and diverse. The
authors used the YouCook2 and MSR-VTT datasets for training and evaluation. The features
used include visual features extracted from the video frames, object features, and trajectory
features. The authors used the BLEU-4, METEOR, and CIDEr metrics to evaluate the model.
The proposed model achieved state-of-the-art performance on the YouCook2 and MSR-VTT
datasets.
Automatic Image and Video Caption Generation With Deep Learning: A Concise Review
and Algorithmic Overlap by Soheyla Amirian [5]: Soheyla Amirian's article on automatic image
and video caption generation with deep learning provides a concise review of the features, evaluation
metrics, and model approaches used in the field. The article explains the importance of automatic
caption generation and discusses the different features used, such as visual and textual features. It also
provides an overview of the different evaluation metrics used, such as BLEU, METEOR, and CIDEr.
The article further describes the different model approaches used, such as the encoder-decoder model
and the attention-based model, and compares their performance based on the aforementioned metrics.
Overall, the article serves as a comprehensive guide to automatic caption generation with deep learning
and is a valuable resource for researchers and practitioners in the field.
Syntax-guided Hierarchical Attention Network for Video Captioning by Jinan Deng [6]:
"Syntax-guided Hierarchical Attention Network for Video Captioning" proposes a novel
framework that incorporates syntactic information to generate informative and grammatically
correct captions for videos. The proposed model uses a hierarchical attention mechanism to
selectively focus on the most relevant video frames and words. Evaluation metrics include
BLEU, METEOR, and CIDEr. The approach is trained on the MPII Movie Description dataset
and evaluated on the MSR-VTT and DiDeMo datasets. The model approach involves encoding
video frames using a CNN, incorporating syntactic information using a parser, and generating
captions using an LSTM with a hierarchical attention mechanism.
Query-Biased Self-Attentive Network for Query-Focused Video Summarization by

Shuwen Xiao [7]: Query-Biased Self-Attentive Network for Query-Focused Video
Summarization" proposes a framework for generating query-focused video summaries that it
helps in selecting the most relevant video segments for a given query for the network.
4
Evaluation metrics include Precision, Recall, F1 score, and Normalized Discounted
Cumulative Gain (NDCG). The approach is trained and evaluated on the TRECVID 2014
dataset. The model approach involves encoding video frames using a CNN, encoding query
using a bi-LSTM, and generating a summary using an LSTM with a query-biased self-attentive
network.
Video Captioning by Adversarial LSTM by Yang Yang [8]: Video Captioning by

Adversarial LSTM is a model that generates video captions by training an adversarial network
consisting of a generator and a discriminator. The generator uses a combination of CNN and
LSTM to encode visual features and generate captions, while the discriminator is trained to
distinguish between real and fake captions. The evaluation metrics used for Video Captioning
by Adversarial LSTM include BLEU-4, METEOR, and ROUGE-L. The model has been
trained and evaluated on datasets such as MSR-VTT and YouTube2Text, which contain videos
from various domains and lengths. The model has shown improved performance compared to
other state-of-the-art methods in terms of diversity and accuracy of generated captions.
"Generating Videos from Textual Descriptions" by Yukun Zhu [9]: The paper proposes a
model that generates videos from textual descriptions using a combination of generative
adversarial networks (GANs) and a spatiotemporal LSTM. The GANs are used to generate
realistic video frames, and the spatiotemporal LSTM ensures temporal coherence between the
frames and generates motion. The proposed model was trained and evaluated on the Charades
dataset, which contains videos with multiple human actions, and multiple textual descriptions
for each video. The model was evaluated according to multiple evaluation metrics such as
Fréchet Inception Distance (FID) and Inception Score (IS), and was shown to outperform
several baseline models. The proposed model can be useful for applications such as video
generation and video captioning.
"Deep Learning Based, a New Model for Video Captioning" by Elif Güşta Özer [10]:
Here, the proposed model uses deep learning techniques for generating video captions. The
model employs a two-stage approach, where a convolutional neural network (CNN) extracts
features from the video frames, and a long short-term memory (LSTM) network generates the
captions. Additionally, an attention mechanism is employed to focus on important parts of the
video frames while generating captions. The model was evaluated on the MSR-VTT dataset,
which contains 10,000 videos with multiple captions for each video. The evaluation metrics
used were BLEU, METEOR, and CIDEr, and the proposed model outperformed several state-
of-the-art models according to these metrics.
5
Learning Video Moment Retrieval Without a Single Annotated Video Junyu Gao [11]:
Junyu Gao's article on learning video moment retrieval without a single annotated video
provides an overview of the features, evaluation metrics, and model approaches used in the
field. The article explains the importance of video moment retrieval and discusses the different
features used, such as visual and audio features. It also provides an overview of the different
evaluation metrics used, such as mean average precision and recall. The article further
describes the different model approaches used, such as the unsupervised cross-modal matching
model and the adversarial self-training model, and evaluates their performance based on the
aforementioned metrics. Overall, the article serves as a comprehensive guide to learning video
moment retrieval without a single annotated video and is a valuable resource for researchers
and practitioners in the field.
Multimodal Dense Video Captioning by Vladimir Iashin [12] Multi-modal Dense Video
Captioning (MDVC) is a task that aims to generate a dense sequence of captions for a given
video clip, where each caption describes a short segment of the video. The task requires
integrating information from both visual and textual modalities. The evaluation metrics used
for MDVC include METEOR, ROUGE-L, and CIDEr. MDVC has been approached using
various models, including a multi-stage framework with a two-stream CNN for visual feature
extraction and an LSTM for language modeling, and a Transformer-based model that utilizes
cross-modal attention mechanisms. The datasets used for MDVC include ActivityNet Captions,
Charades-STA, and YouCook2.
Event-Centric Hierarchical Representation for Dense Video Captioning Teng Wang[13]:

Teng Wang's article on event-centric hierarchical representation for dense video captioning
provides an overview of the features, evaluation metrics, and model approaches used in the
field. The article explains the importance of dense video captioning and discusses the different
features used, such as visual and textual features. It also provides an overview of the different
evaluation metrics used, such as METEOR and ROUGE. The article further describes the
different model approaches used, such as the event-centric hierarchical representation and the
temporal segment network, and compares their performance based on the aforementioned
metrics. Overall, the article serves as a comprehensive guide to dense video captioning with
event-centric hierarchical representation and is a valuable resource for researchers and
practitioners in the field.
6
Image and Video Captioning with augmented Neural Architectures by Rakshith
Shetty[14]: Image and Video Captioning with Augmented Neural Architectures is a model that
uses augmented neural architectures to generate captions for both images and videos. The
model incorporates visual attention mechanisms, residual connections, and multi-level feature
fusion to enhance its performance. The evaluation metrics used for Image and Video
Captioning with Augmented Neural Architectures include BLEU, METEOR, and CIDEr. The
model has been trained and evaluated on several datasets, including COCO, Flickr30k, and
MSVD, and has shown improved performance compared to other state-of-the-art methods in
terms of both caption quality and diversity. The model also provides an interpretable
mechanism for visual attention, which allows for a better understanding of the captioning
process.
Effect of Batch Normalization and Stacked LSTMs on Video Captioning by Vishwanath

Sarathi [15]: Vishwanath Sarathi's article on the effect of batch normalization and stacked
LSTMs on video captioning provides an overview of the features, evaluation metrics, and
model approaches used in the field. The article explains the importance of video captioning and
discusses the different features used, such as visual and textual features. It also provides an
overview of the different evaluation metrics used, such as BLEU, METEOR, and ROUGE. The
article further describes the different model approaches used, such as the encoder-decoder
model with and without batch normalization and stacked LSTMs, and compares their
performance based on the aforementioned metrics. Overall, the article serves as a
comprehensive guide to improving video captioning with batch normalization and stacked
LSTMs and is a valuable resource for researchers and practitioners in the field.
7
Table no. 2.1
Author Year Of Model Datasets Features Evaluation

Publication Approached Metric And
Result
Parallel pathway Visual features, Accuracy 31.1

Choi et al., ActivityNet Captions
2022 architecture using motion features, QWK 31.1,
[1] YouCook2.
CNN,RNN language features Correlation 0.326
BLEU-4 : 41.5%
METEOR :
Video Captioning
Padmawar et MSVD, Visual features, 29.8%
2022 using Neural
al.,[2] MVAD Language features ROUGE-L:
Network
57.2%
CIDEr : 39.5%
BLEU-4: 20.1
Yuan et Controllable Video MSRVTT, ActivityNet Visual features, METEOR:15.7
2021
al.[3] Captioning Captions language features CIDEr: 13.5
ROUGE-L:41.9
Structured BLEU-4: 30.4

trajectory modeling Visual features, METEOR: 18.6
Zhu et al. [4] 2020 MSVD and MSR-VTT.
and adversarial spatial features CIDEr: 32.3
learning ROUGE-L: 54.3
Video Caption
Amirian et MSVD, Visual features,
2020 Generation with BLEU-4: 40.3 %
al., [5] Activity Net Captions Motion features
Deep Learning
Visual,
Syntax-guided
Linguistic,
Yang et al[6] 2019 Hierarchical MSR-VTT CIDEr: 125.1 %
Syntax guided
Attention Network
attention
8
Table no. 2.2
Author Year Of Model Datasets Features Evaluation

Publication Approached Metric And
Result
ROUGE -1: 45.3

Query-focused, %
Query-Biased
Wang et self-attention, ROUGE -2: 22.3
2019 Self-Attentive TVSum.
al.[7] video %
Network
summarization. ROUGE -L: 42.4
%
Video Captioning Adversarial

Wei et al[8] 2018 by Adversarial MSR-VTT, training, CIDEr: 120.7 %
LSTM. diverse captions.
GAN-based
Generating videos
Yukun Zhu TACoS Multi-Level generator, BLEU-4: 35 %
2018 from textual
et al.,[9] (TML) spatiotemporal CIDEr: 49 %
descriptions
LSTM
BLEU-4: 48.1%
Deep Learning ROUGE-L:
Video feature
based, a New 59.9%
Özer[10] 2018 MSVD extraction, RNN,
Model for Video CIDEr: 48.1%
Attention.
Captioning METEOR:
21.8%
Unsupervised
Hendricks et Learning Video moment retrieval, Retrieval
2017 ActivityNet Captions
al. [11] Moment Retrieval text-video accuracy: 53.1 %
embedding.
Recall@1: 41 %
Gao et Multi-modal Dense Multimodal
2017 ActivityNet Captions Recall@5: 67.4
al.[12] Video Captioning dense captioning.
%
Event-Centric Event-centric, Recall@1: 31.4%

Krishna et al
2017 Hierarchical Charades-STA. hierarchical Recall@5: 57.8
[13]
Representation representation %
9
Table no. 2.3
Author Year Of Model Datasets Features Evaluation Metric

Publication Approached And
Result
BLEU-1: 74.8%
BLEU-4: 29.2%
CNN, RNN, Attention, Data
Shetty et al METEOR: 25.5%
2017 Attention, COCO, MSVD augmentation,
[14] ROUGE-L: 55.5%
Augmentation. CNN-RNN.
CIDEr: 93.5%
Batch
Effect of Batch
Xu et al [15] 2017 MSR-VTT Normalization, CIDEr: 120.2 %
Normalization
Stack LSTM
10
Chapter 3
Proposed System
3.1 Overview
The videos captured on camera and other sources such as the Internet do not have a description,
but the human can largely understand them by manually observing those. It is difficult to write
captions manually for a large set of images and videos. The description may vary based on each
individual perception, mood and interpretation at the time of observation, which sometimes
leads to inaccuracy.
To identify and refer to a specific video frame manually is a tedious job. So, Video captioning
using machine learning is the process of automatically generating textual descriptions of the
content of a video using algorithms and models. This involves analyzing the visual and audio
information in the video and converting it into natural language text that describes the actions,
objects, and context of the video.
Machine learning techniques such as deep neural networks are commonly used to learn patterns
and relationships in large amounts of video data and generate accurate and coherent captions for
a variety of videos.
11
Existing System Architecture [7]
Fig. 3.1 Existing system architecture
An object-oriented structured trajectory video captioning system is a computer vision system that
generates natural language descriptions of videos based on the motion trajectories of objects within the
video. This system uses object detection and tracking algorithms to identify and track objects in the
video, and then extracts motion trajectories to represent the movement of each object. These trajectories
are then used to create a structured representation of the video, which is used to generate captions using
natural language processing techniques. The system is designed to improve the accuracy and coherence
of video captions by focusing on the object trajectories rather than just the video frames themselves.
Overall, the system is a promising approach to automated video captioning with potential applications in
fields such as surveillance, robotics, andentertainment.
12
3.1.1 Proposed System Architecture
Fig 3.2 Proposed System Architecture
The user will browse the site. The user will upload the video, the video will get divided into frames.
CNN will identify the objects present in the frame using the dataset which consists of a large set of
frames, along with corresponding labels indicating the class to which each frame belongs. LSTM will
start preparing captions considering the objects present in the frame using the Training Dataset, which
comprises Video Data set and a Text Data Set. After the training, a suitable caption will be generated.
In case of any grammatical incorrectness in the caption. DRPN will remove grammatical errors in the
caption. After removing grammatical errors the caption will be displayed.
3.2 Methodology
1) Keyframe Extraction for Video: Keyframe extraction is a process of identifying

representative frames from a video to generate captions. These frames contain
important visual information that can aid in describing the video's content. Key frame
extraction
algorithms use techniques such as object detection, motion analysis, and saliency detection
to identify frames that are most relevant to the video's content. These key frames are then
used to generate captions that accurately describe the video's content.
13
2) Feature Extraction Using CNN: Convolutional Neural Networks (CNNs) are
commonly used for feature extraction in image recognition tasks. They are able to extract
features by analyzing the image at different levels of abstraction. CNNs typically consist
of multiple convolutional layers that apply filters to the image and extract features such
as edges,
corners, and shapes. These features are then passed through a pooling layer to reduce
dimensionality and increase computational efficiency. The resulting feature maps are
then fed into fully connected layers for classification or regression tasks.
3) Sentence Formation Using LSTM: Long Short-Term Memory (LSTM) is a type of

Recurrent Neural Network (RNN) that can be used for sentence formation. In this
approach, each word in the sentence is encoded into a vector and passed as input to the
LSTM. The LSTM then processes the sequence of input vectors and generates a sequence
of output vectors that represent the words in the sentence. The output vectors can be
decoded to form the final sentence. LSTMs are particularly useful for tasks such as
language translation, text generation, and speech recognition.
3.2.3 Sample Dataset Used
An experiment is conducted in order to identify the input/output behavior of the system. Identify
inputs. Specify the sample inputs that would be used in the experiments. The sample dataset used
in the experiment are identified and given in Table 3.1
table 3.1 Sample Dataset Used
Dataset Users Items Interactions Type
Video Caption 10,000 5000 1,00,000 Mixed
You tube Caption 50,000 20,000 5,00,000 Active
TED Talks Caption 2,000 1,000 20,000 Passive
14
3.2.4 Hardware and Software Specifications
The experiment setup is carried out on a computer system which has different hardware and software
specifications as given in Table 3.2 and Table 3.3 respectively.
Table 3.2 Hardware Specifications
Processor 2 GHz Intel
HDD 180 GB
RAM 2 GB
Table 3.3 Software Specification
Operating System Windows XP Professional With Service pack 2
Programming Language JDK 1.8
Database Oracle 9
3.2.5 Evaluation Metrics

The quality of a domain system can be evaluated by comparing recommendations to a test set of
known user ratings. These systems are typically measured using precision and recall [6].
Precision: A measure of exactness, determines the fraction of relevant items retrieved out of all
items retrieved. Precision (P) It is given in Equation 3.1. It is the proportion of recommended
movies that are actually good.
Recall: a measure of completeness, determines the fraction of relevant items retrieved out of all
relevant items. Recall (R) is given in Equation 3.2. It is the proportion of all good movies
recommended.
15
Chapter 4
Applications
There are various applications of this domain system. The application is listed here.
4.1 Social
1. Education: Video captioning can be used to provide additional support for students who
need help with language comprehension, including those learning a new language or
students with reading difficulties.
2. Marketing: Video captioning can help businesses reach a wider audience by making their
content accessible to people who prefer to watch videos with captions.
3. Search engine optimization: Captioning can also be used for search engine optimization
(SEO), as captions can improve the discoverability of videos and improve search engine
rankings.
4. Entertainment: Video captioning can also be used to create subtitles for movies, TV
shows, and other forms of entertainment, making them accessible to people who are deaf or
hard of hearing.
4.3 Technical
The technical applications of video caption generators involve the use of various technologies and
tools to create accurate and efficient captions for video content. Here are some technical
applications of video caption generators:
1. Natural Language Processing (NLP): Video caption generators use NLP techniques to
analyze the audio of the video and convert it into text. This involves recognizing speech
patterns and identifying words and phrases to create a written transcript.
2. Machine Learning: Machine learning algorithms can be used to train the video caption
generator to recognize specific accents, dialects, and languages. This allows for more
16
accurate captioning, even in complex situations such as multiple speakers or background noise.
3. Speech Recognition: Video caption generators use speech recognition software to

transcribe spoken words into text. This technology has advanced significantly in recent
years, allowing for more accurate and reliable captioning.
4. Text-to-Speech: Text-to-speech technology can be used to convert the written transcript

into an audio file that is synchronized with the video. This can be useful for people who are
deaf or hard of hearing.
5. Caption Embedding: Some video caption generators use caption embedding techniques to
create captions that are overlaid onto the video itself. This creates a more seamless and
integrated viewing experience, as the captions appear directly on the video rather than in a
separate window.
Overall, video caption generators use a variety of advanced technologies to create accurate,
efficient, and accessible captions for video content. These technical applications are essential for
ensuring that video content is accessible to everyone, regardless of their hearing ability or
language proficiency.
17
Chapter 5
Summary
The report provides a comprehensive overview of video captioning systems, which are software or
applications designed to automatically generate captions or subtitles for videos. The purpose of
these systems is to make video content accessible to a wider audience, including people with
hearing impairments or those who prefer to watch videos with captions. The report outlines the
process of creating captions, which involves several steps such as audio transcription, text
normalization, and time synchronization. Audio transcription is the process of converting the
spoken words in the video into text format, while text normalization involves correcting any errors
or inconsistencies in the text. Time synchronization is the process of aligning the captions with the
corresponding audio or video segments.
Different types of video captioning systems are discussed in the report, including deep learning
systems. Deep learning is a subfield of machine learning that uses neural networks to analyze large
amounts of data and make predictions or decisions based on that data. Deep learning systems can
be highly accurate in generating captions, but they also require a large amount of training data and
computational resources. The report also describes the evaluation of video captioning systems,
which typically involves assessing the accuracy and readability of the captions. Factors such as the
system's ability to handle different accents and dialects, as well as its ability to generate culturally
sensitive captions, are also important considerations.
Overall, the report concludes that video captioning systems have the potential to greatly improve
the accessibility of video content. However, there are still challenges that need to be addressed,
such as handling different accents and dialects and generating culturally sensitive captions. Despite
these challenges, video captioning systems represent an important step towards making video
content more inclusive and accessible to all.
18
References
1) V. lashin and E. Rahtu, "Multi-modal Dense Video Captioning," 2020 IEEE/CVF

Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle,
WA, USA, 2020, pp. 4117-4126, doi: 10.1109/CVPRW50498.20
2) Adversarial Learning," in IEEE Access, vol. 8, pp. 169146-169159, 2020, doi:

10.1109/ACCESS.2020.3021857..
3) W. Choi, J. Chen and J. Yoon, "Parallel Pathway Dense Video Captioning With
Deformable Transformer," in IEEE Access, vol. 10, pp. 129899-129910, 2022, doi:
10.1109/ACCESS.2022.3228821..
4) J. Deng, L. Li, B. Zhang, S. Wang, Z. Zha and Q. Huang, "Syntax-Guided Hierarchical

Attention Network for Video Captioning," in IEEE Transactions on Circuits and Systems
for Video Technology, vol. 32, no. 2, pp. 880-892, Feb. 2022, doi:
10.1109/TCSVT.2021.3063423.
5) S. Xiao, Z. Zhao, Z. Zhang, Z. Guan and D. Cai, "Query-Biased Self-Attentive Network

for Query-Focused Video Summarization," in IEEE Transactions on Image Processing, vol.
29, pp. 5889-5899, 2020, doi: 10.1109/TIP.2020.2985868.
6) J. Gao and C. Xu, "Learning Video Moment Retrieval Without a Single Annotated Video,"
in IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp.
1646-1657, March 2022, doi: 10.1109/TCSVT.2021.3075470
7) V. Sarathi, A. Mujumdar and D. Naik, "Effect of Batch Normalization and Stacked LSTMs
on Video Captioning," 2021 5th International Conference on Computing Methodologies
and Communication (ICCMC), Erode, India, 2021, pp. 820-825, doi:
10.1109/ICCMC51019.2021.9418036.
8) Y. Yang et al., "Video Captioning by Adversarial LSTM," in IEEE Transactions on Image

Processing, vol. 27, no. 11, pp. 5600-5611, Nov. 2018, doi: 10.1109/TIP.2018.2855422.
9) S. Amirian, K. Rasheed, T. R. Taha and H. R. Arabnia, "Automatic Image and Video

Caption Generation With Deep Learning: A Concise Review and Algorithmic Overlap," in
IEEE Access, vol. 8, pp. 218386-218400, 2020, doi: 10.1109/ACCESS.2020.3042484.
10) R. Shetty, H. R. Tavakoli and J. Laaksonen, "Image and Video Captioning with Augmented
Neural Architectures," in IEEE MultiMedia, vol. 25, no. 2, pp. 34-46, Apr.-Jun. 2018, doi:
10.1109/MMUL.2018.112135923.
19
Acknowledgement
We would like to express our special thanks to Prof. Supriya Khaitan Chandra, our
major-project guide who guided us through the project and who helped us in applying the
knowledge that we acquired during the semester and learning new concepts.
We would like to express our special thanks to Dr. Satishkumar Varma, Head of the
Department of Information Technology, who gave us the opportunity to do this major project
because of which we learned new concepts and their application.
We are also thankful to our major-project coordinator Prof. Krishnendu Nair along with
other faculties for their encouragement and support.
Finally, we would like to express our specials thanks of gratitude to Principal Dr. Sandeep
Joshi, who gave us the opportunity and facilities to conduct this major project
Soumodip Dutta
Anuj Patil
Martin Joseph
20

Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report

Uploaded by

Copyright:

Available Formats

Major Project-I Report

Department of Information Technology

in partial fulfillment of Bachelor of Technology of Mumbai University in the Department of

Head of Department Principal

Project Group Members:

List of Figures...................................................................................................................... viii

1.3 Problem Statement............................................................................................ 1

2.1 Literature Review ………………………………………................................ 3

3.1.1 Proposed System Architecture………………………………………. 13

3.2.3 Sample Dataset Used… ............................................................................ 14

3.2.4 Hardware And Software Specifications… ............................................... 15

3.2.5 Evaluation Metrics………………………………………………… 15

Figure 3.1 Existing System Architecture 12

Figure 3.2 Proposed system architecture 13

Table 2.1 Summary of literature survey 8

Table 3.1 Sample Dataset Used 14

Table 3.2 Hardware details 15

Table 3.3 Software details 15

1. To generate accurate and coherent captions for videos.

1.3 Problem Statement

2.1 Literature Review

"Parallel Pathway Dense Video Captioning with Deformable Transformer" by Wangyu

Query-Biased Self-Attentive Network for Query-Focused Video Summarization by

Video Captioning by Adversarial LSTM by Yang Yang [8]: Video Captioning by

Event-Centric Hierarchical Representation for Dense Video Captioning Teng Wang[13]:

Effect of Batch Normalization and Stacked LSTMs on Video Captioning by Vishwanath

Author Year Of Model Datasets Features Evaluation

Parallel pathway Visual features, Accuracy 31.1

Structured BLEU-4: 30.4

Author Year Of Model Datasets Features Evaluation

ROUGE -1: 45.3

Video Captioning Adversarial

Event-Centric Event-centric, Recall@1: 31.4%

Author Year Of Model Datasets Features Evaluation Metric

Fig. 3.1 Existing system architecture

Fig 3.2 Proposed System Architecture

1) Keyframe Extraction for Video: Keyframe extraction is a process of identifying

3) Sentence Formation Using LSTM: Long Short-Term Memory (LSTM) is a type of

3.2.3 Sample Dataset Used

Dataset Users Items Interactions Type

Video Caption 10,000 5000 1,00,000 Mixed

You tube Caption 50,000 20,000 5,00,000 Active

TED Talks Caption 2,000 1,000 20,000 Passive

Table 3.2 Hardware Specifications

Processor 2 GHz Intel

Table 3.3 Software Specification

Operating System Windows XP Professional With Service pack 2

Programming Language JDK 1.8

3.2.5 Evaluation Metrics

3. Speech Recognition: Video caption generators use speech recognition software to

4. Text-to-Speech: Text-to-speech technology can be used to convert the written transcript

1) V. lashin and E. Rahtu, "Multi-modal Dense Video Captioning," 2020 IEEE/CVF

2) Adversarial Learning," in IEEE Access, vol. 8, pp. 169146-169159, 2020, doi:

4) J. Deng, L. Li, B. Zhang, S. Wang, Z. Zha and Q. Huang, "Syntax-Guided Hierarchical

5) S. Xiao, Z. Zhao, Z. Zhang, Z. Guan and D. Cai, "Query-Biased Self-Attentive Network

8) Y. Yang et al., "Video Captioning by Adversarial LSTM," in IEEE Transactions on Image

9) S. Amirian, K. Rasheed, T. R. Taha and H. R. Arabnia, "Automatic Image and Video

You might also like