Step by Step a Gradual Approach for Dense Video Captioning

Received 8 May 2023, accepted 20 May 2023, date of publication 24 May 2023, date of current version 1 June 2023.
Digital Object Identifier 10.1109/ACCESS.2023.3279816
Step by Step: A Gradual Approach for Dense

Video Captioning
WANGYU CHOI 1, JIASI CHEN 2, (Member, IEEE), AND JONGWON YOON 1, (Member, IEEE)
1 Department of Computer Science and Engineering (Major in Bio Artificial Intelligence), Hanyang University, Ansan 15588, South Korea
2 Department of Computer Science and Engineering, University of California at Riverside, Riverside, CA 92521, USA
Corresponding author: Jongwon Yoon (jongwon@hanyang.ac.kr)

This research was supported in part by the Basic Science Research Program through the National Research Foundation of South Korea
(NRF) funded by the Ministry of Education under Grant NRF-2022R1A2C1008743; and in part by the MSIT (Ministry of Science and
ICT), Korea, under the Grand Information Technology Research Center support program (IITP-2023-2020-0-01741) supervised by the
IITP (Institute for Information & communications Technology Planning & Evaluation).
ABSTRACT Dense video captioning aims to localize and describe events for storytelling in untrimmed
videos. It is a conceptually very challenging task that requires concise, relevant, and coherent captioning
based on high-quality event localization. Unlike simple temporal action localization tasks without overlap-
ping events, dense video captioning requires detecting multiple/overlapping regions in order to branch out
the video story. Most existing methods generate numerous candidate event proposals and then eliminate
duplicate ones using a event proposal selection algorithm (e.g., non-maximum suppression) or generate
event proposals directly through box prediction and binary classification mechanisms, similar to object
detection tasks. Despite these efforts, the aforementioned approaches tend to fail to localize overlapping
events into different stories, hindering high-quality captioning. In this paper, we propose SBS, a dense video
captioning framework with a gradual approach that addresses the challenge of localizing overlapping events
and eventually constructs high-quality captioning. SBS accurately estimates the number of explicit events for
each video snippet and then detects the boundaries context/activities, which are the details for generating the
event proposals. Based on both the number of events and boundaries, SBS generates the event proposals. SBS
encodes the context of the event sequence and finally generates sentences describing the event proposals.
Our framework is fairly effective in localizing multiple/overlapping events, thus experimental results show
the state-of-the-art performance compared to the existing methods.
INDEX TERMS Dense video captioning, event captioning, event localization, event proposal generation,
video captioning.
I. INTRODUCTION shortcomings, dense video captioning has recently emerged

The advent of large-scale video activity datasets [1], [2], to describe the video in complexity and detail.
[3] has made significant progress in video captioning tasks, The dense video captioning task is conceptually very
which aims to generate sentences describing activity in short challenging because it is required to detect multiple salient
videos [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], temporal regions (i.e., events) that overlap in untrimmed
[15], [16]. However, real-world videos such as video surveil- video. Most of the existing work adopts an approach that
lance are generally untrimmed and contain multiple stories divides captioning task into two sub-tasks, event localiza-
that overlap in time, therefore describing the video in a tion and event captioning. It then sequentially constructs
single sentence is insufficient and does not provide all the the two sub-tasks from the encoded video, either bottom-
necessary information. To overcome the above-mentioned up [17], [18], [19], [20], [21] or top-down [22]. Despite
these efforts, there are still limitations to accurately localizing
The associate editor coordinating the review of this manuscript and overlapping events in video. In particular, the bottom-up
approving it for publication was Khursheed Aurangzeb. approach (i.e., localize-then-describe) generates numerous
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

VOLUME 11, 2023 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 51949
W. Choi et al.: Step by Step: A Gradual Approach for Dense Video Captioning
FIGURE 1. Comparison between the existing methods and ours for dense video captioning. Similar to human interpretation, the proposed
algorithm first scans a given video and explicitly determines the number of events. It then generates events by specifying the start and end
timestamps of the salient region based on the number of events. Finally, it describes a specific sentence for each event region. On the other hand,
existing methods generate a large number of event proposals and then remove duplicates with a han-crafted algorithm such as NMS. This makes it
difficult to detect different events (i.e., actions) in the same time period.
event proposals [23] from the extracted temporal video fea- dense video captioning contains multiple overlapping events,
tures and eliminates them with a hand-crafted algorithm such so the event localization module needs to detect an explicit
as non-maximum suppression (NMS) algorithm. A simple event count. In addition, algorithms used for event proposal
temporal feature is insufficient to detect overlapping events, selection, such as NMS, use only start and end timestamps
and NMS algorithm tends to eliminate overlapping events without considering the content, and hence they unnecessar-
despite having different activities. In contrast, the top-down ily remove information. In captioning subtask, understanding
approach, that localizes after generating a paragraph from a the context between events at a high level is essential to ensure
video, is usually difficult to generate multiple sentences for concise, relevant, and coherent sentence generation in event
the same temporal region. The undetectability of overlapping captioning.
events hinders the multi-story video description. Inspired by the observations mentioned above, we propose
Figure 1 shows an example of the difference between the a framework for dense video captioning, SBS, which is a
existing methods and our method. Existing methods are an gradual approach from abstract to detail. As depicted in
example of a bottom-up approach that has been widely used. Figure 1, procedures in SBS consists of five steps as follows.
They generate a number of proposals from the video and then First, it estimates the number of events in the temporal domain
consider their start and end timestamps to remove duplicates from the encoded features. Unlike binary classification such
regardless of the video’s content. In other words, when there as positive or negative for actionness, the explicit number
are multiple actions in the same temporal region, they cannot of events is effective for localizing overlapping events as
distinguish them and leave only one. On the other hand, our it provides the maximum number of overlapping events in
method is an approach considering video content. We first the corresponding temporal region. Second, SBS detects the
scan the video and determine the number of actions in each boundaries of events in the temporal domain for additional
temporal region. We then estimate the exact boundary for information in order to generate event proposals. This step
each event and generate final event proposals. Consequently, detects changes in the video scene (e.g., appearance of an
ours is a flexible framework for overlapping proposals. object/start and end of an action) to accurately determine the
We observed several key factors in dense video captioning start and end of an event. Third, SBS generates a set of event
task that affect both the event localization and event caption- proposals contained in the video using the number of events
ing. Unlike simple temporal action localization [24], [25], and their boundaries by introducing an actionness-weighted
[26], [27], [28] without event overlapping, video frame in tIoU algorithm without any proposal selection algorithms.
51950 VOLUME 11, 2023

Fourth, SBS encodes the context of generated event proposals bias may be a better choice for achieving high performance.
to ensure coherence between sentences before generating a As such, most models [17], [18], [19], [21], [37] are not
sentence for events. Finally, it constructs sentences using the designed to be end-to-end trainable for dense video cap-
encoded context for the event proposals. In other words, SBS’ tioning. Furthermore, end-to-end trainable models are not
captioning network consists of two-level RNNs (i.e., context naturally interpretable.
encoder and sentence decoder), which guarantees sentence
fluency and coherence between sentences in the video. 2) ENCODING METHOD FOR TEMPORAL DEPENDENCY
The rest of the paper is organized as follows. We first To encode sequential data, like video, the encoder must
discuss related work in Section II. The proposed method and capture temporal dependency. This is particularly important
its training methodology are described in Sections III and IV, in dense video captioning, which deals with long videos.
respectively. We evaluate our framework in Section V, and Many frameworks adopt LSTMs [17], [18], [19], [21], [37]
conclude the paper in Section VI. or Transformers [20], [22], [36] based on their proven effec-
tiveness in the NLP field. Prior works have shown that Trans-
II. RELATED WORK formers have advantages over LSTMs. First, Transformers
A. VIDEO CAPTIONING TASKS capture long-term dependencies through self-attention mech-
1) VIDEO CAPTIONING anisms, whereas LSTMs can lose important information
Traditional video captioning task aims to generate a single in long sequences. Second, Transformers can parallelize
sentence that describe most important event in a short video processing for input sequences, leading to faster process-
(i.e., trimmed video). Video captioning frameworks [16], ing during training and inference, whereas LSTMs increase
[4] have been proposed, inspired by sequence-to-sequence computation time proportionally to the length of the input
tasks [29], [30]. Their architecture consists of two parts: sequence. However, Transformers generally require more
an input video encoder and a sentence generator. To this memory for the same performance compared to LSTMs.
end, they employ Convolutional Neural Networks (CNNs)
and Recurrent Neural Networks (RNNs) such as Long 3) DESIGN PARADIGM
Short-Term Memory (LSTM), and Gated Recurrent Unit There are various design paradigms for dense video caption-
(GRU). The video encoder first extracts visual features from ing, including bottom-up, top-down, and parallel decoding.
input frames and then uses RNNs to identify the temporal Bottom-up and top-down approaches divide dense video cap-
dependencies. The sentence generator uses RNNs to generate tioning into two subtasks: event localization and captioning,
words one by one. For example, methods [4], [16] extract and process them sequentially. Bottom-up approaches [17],
visual features from a video using CNNs, encode the video [18], [19], [21], [37] describe the corresponding event after
features, and then use LSTMs to decode them into natural lan- event localization, while top-down approaches [22] perform
guage. To further improve the captioning quality, subsequent event localization based on multiple sentences generated
work extends the encoder-decoder structure by consolidating about the video. Bottom-up and top-down have the limitation
temporal attention mechanism [5], hierarchical RNNs [6], that the final result is influenced by the previous module
[8], LSTMs with visual-semantic embedding [7], semantic because they process subtasks sequentially. In addition, the
decoder [9], [10], [11], spatial-hard attention [31], recon- bottom-up approach requires an event proposal selection
struction network [12], and reinforced adaptive attention [32]. module, such as NMS. To address this, the parallel decoding
Despite these efforts, they are limited by generating a single approach performs event localization and captioning simul-
sentence from a trimmed video. To address this limitation, taneously. However, this approach has the disadvantage that
several works [33], [34], [35] have emerged to describe the the number of events needs to be predefined by an additional
video in one detailed paragraph. event counter module, which has a high dependency.
B. PROS AND CONS OF EXISTING METHODS C. DESIGN CHOICE

We establish several classification criteria for existing Considering the observations mentioned earlier, we adopt a
dense video captioning frameworks to clearly identify their bottom-up approach for high performance within a limited
strengths and weaknesses. dataset while overcoming its challenges. Firstly, we focus
more on event localization in the subtask to aim for more
1) END-TO-END TRAINABLE accurate event localization. As a result, this minimizes error
End-to-end trainable models [20], [36] are typically preferred propagation to subsequent modules and eliminates the need
because they are simple and highly reproducible. They have for an event proposal selection algorithm. We adopt a trans-
a low inductive bias, allowing the loss for the final output to former as an encoder, considering long-term dependencies.
propagate back to the input. This results in high performance Our method is superior to the existing methods in three
with sufficient support from a large dataset. However, for aspects. First, SBS explicitly evaluates the number of actions,
the dense video captioning task, where the dataset is lim- not binary, for actionness. Most methods of adopting action-
ited compared to the output’s complexity, a higher inductive ness for event localization evaluate to probability (i.e., binary)
VOLUME 11, 2023 51951

FIGURE 2. Summary of procedures in SBS. (1) Video encoder: Given a video, SBS first extracts the spatiotemporal video features using the C3D networks.
The transformer encoder encodes the C3D features with long-range temporal dependency. (2) Temporal event counter: The first step for event localization
is to estimate the explicit number of events by taking the encoded video as input. (3) Temporal boundary classifier: To accurately localize the start and
end of the event, the next step is to classify the boundaries of the events in the timeline. (4) Event proposal generation: The algorithm generates event
proposals from the acquired event count and boundaries without an event selection algorithm. (5) Context-level event encoder: A context-level event
encoder composed of a single LSTM takes a sequence of event proposals as input and outputs the context-encoded hidden state. (6) Sequential event
captioner: Finally, the sequential event captioner outputs each word together with the context feature.
such as in [24]. This makes it difficult to detect multi- captioner. Given a video V , it first composes a sequence
ple events in the same temporal region. Second, SBS does of snippets with the video encoder. It then uses a temporal
not rely on hand-crafted algorithms such as NMS. Since event counter network to estimate the number of overlapping
methods [7], [17], [19], [20], [21], [38] that employ NMS events in the timestamp corresponding to each snippet. For
algorithms ignore the content of the video, necessary events detailed localization of video segments, it detects the start and
may be eliminated, which may omit an important descrip- end boundaries for each event and generates event proposals
tion of the video. Finally, a two-step caption generation net- along with the number of events. The next step is to generate
work allows the generation of fluent and coherent sentences. a sentence for each event proposal. To improve coherence
In dense video captioning, one event is not only related to between sentences, we construct the captioning networks
one sentence, but also sentences in a video are related to in two steps. It converts the video segments corresponding
each other in context. Thus, the sentence generation module to each proposal into context-level representations. Finally,
adopting a two-level structure following this concept makes it generates sentences using the context-level representations.
sentences fluent and coherent.
B. VIDEO ENCODER
III. METHOD For a given video V , we divide the video into non-
A. OVERVIEW overlapping, fixed-length (i.e., 8 frames) snippets V = {vn }.
An overview of SBS is depicted in Figure 2. SBS consists of The goal of the video encoder is to extract the hidden states
six modules: (1) video encoder, (2) temporal event counter, H v for all snippets. Our video encoder consists of a CNN
(3) temporal boundary classifier, (4) event proposal genera- backbone and a sequential data encoder(e.g., RNN). Specif-
tor, (5) context-level event encoder, and (6) sequential event ically, we adopt the C3D network [39] and the transformer
51952 VOLUME 11, 2023

encoder [40] with multi-head attention (MA) in considera- average pooling function with input vectors and output size.
tion of their efficiency and performance. C3D network is a We set N to 15 by adding the probability of no event proba-
widely used pre-trained video feature extractor and the trans- bility (i.e., background snippet) to the maximum number of
former encoder allows encoding of long-range dependencies overlapping events of 14.
(The maximum length of video included in the dataset is Our temporal event counter has clear advantages in localiz-
2, 827 × 8 seconds. Details are described in Section V-A). ing the multiple overlapping events. Specifically, it eliminates
In SBS, C3D networks take each snippet as input and extract the need for the non-maximum suppression (NMS) algorithm
features Fv ∈ RT ×df , where T is the number of snippets and by explicitly detecting the number of events (i.e., salient
df is the dimension of features. We apply up/down sampling region or interval) taking into account the spatiotemporal
with the nearest-neighbor interpolation algorithm to feed the context of the video. The NMS algorithm enforces a tradeoff
video feature Fv of dimension df to the transformer encoder between recall and precision because it is fully hand-crafted
of dimension dm : and operates with a single, fixed threshold [41]. Our temporal
event counter helps to determine the boundaries of events
H 0v = Inearest (Fv ) (1)
later.
Then, starting with the input H 0v , it iteratively injects the
output of the layers as many as the number of encoder layers D. TEMPORAL BOUNDARY CLASSIFIER
(multi-head attention) into the input. In the previous step, we detected snippets containing salient
actions and estimated their number with the temporal event
H l+1
v = FFN 9 H l
v + MA H l
v , H l
v , H l
v (2) counter, however the number of events is not sufficient to
FFN(x) = max (0, xW 1 + b1 ) W 2 + b2 , (3) generate the complete events (i.e., there are multiple cases).
To converge these into one case, we introduce an additional
where 9 (·) represents the layer normalization function. module, temporal boundary classifier. The goal of the tem-
Finally, the output of video encoder is the output of the last poral boundary classifier is to detect the boundary (i.e., start,
layer H v . end or both) of an event given a snippet. To this end, we adopt
a two-stream architecture (shown in Figure 2) where the
C. TEMPORAL EVENT COUNTER first one reuses convolutional layers with fixed parameters
The goal of the temporal event counter is to accurately of the temporal event counter, and the second one employs
estimate the number of events contained in given video multiple dense (i.e., small size of kernel and stride) temporal
snippets. Toward this, we employ multiple temporal 1D 1D convolutional layers. Our intuition is that the temporal
convolutional layers to capture the dependency of tem- boundary classifier extracts actionness and boundary features
poral range. Temporal convolutional layer is denoted as from a sequence of snippets to localize the exact start and end
Conv(cn , ck , cs ), where cn , ck , and cs are the number of timestamps of an event. The subsequent process is the same
filters, kernel size, and stride of temporal convolutional layer, as the temporal event counter, but for each snippet, it outputs
respectively. As shown in Figure 2, we define three types the start and end probability Pb = {(pbl,s , pbl,e )}Ll=1 . We refer
of temporal convolutional layers to consider different ranges to Pb as an event boundary table.
of snippets: (i) Conv(512, 2, 1), (ii) Conv(512, 10, 5) and
(iii), Conv(512, 20, 10). One snippet has a duration of about E. EVENT PROPOSAL GENERATION
0.27 seconds, so the kernel sizes of temporal convolutional We have obtained the event counter table and event boundary
layers are 0.53, 2.67, and 5.33 seconds. We stack the outputs table in Sections III-C and III-D. Next, SBS uses these two
of each layer in temporal alignment at the channel level. tables to generate event proposals.
The average pooling layer outputs the probability for the Our event proposal generation starts from the event counter
number of events that correspond to the snippet. We call this table Pc and event boundary table Pb obtained in sections III
probability an event counter table. The event counter table and IV. We recall them as follows:
−1
can be denoted as Pc = {pcl,n }L,N l=1,n=0 , where L and N are
−1
the length of input snippets and the maximum number of Pc = {pcl,n }L,N
l=1,n=1
overlapping events, respectively. The whole process can be
Pb = {(pbl,s , pbl,e )}Ll=1
represented:
Fnarrow = Conv(512,2,1) (H v ) (4) Then, we create a 1D actionness table Ta from Pc . We can
compute the score value for actionness as the sum of the
Fmid = Conv(512,10,5) (H v ) (5)
product of the probability of the corresponding event count
Fwide = Conv(512,20,10) (H v ) (6) and the event count itself in each snippet slot using the
Fall = Stack(Fnarrow , Fmid , Fwide ) (7) following formula:
Pc = AvgPool(Fall , N ), (8) "N −1 #L
X
where Stack(·) is a function that concatenates vectors with Ta = n · pcl,n (9)
temporal alignment at the channel level. AvgPool(·) is an n=0 l=1
VOLUME 11, 2023 51953

Next, we generate candidate event proposals by constructing event counter to estimate the number of actions for each snip-
all possible start-end pairs from the event boundary table Pb . pet in the given video. We then jointly train the video encoder
and the temporal boundary classifier to estimate whether each
Ê = {(ιs , ιe ) | ιs < ιe , pbιs ,s ≥ 0.5, pbιe ,e ≥ 0.5} (10) snippet is the background, the start, or both of an event.
where, ιs and ιe are the start and end index of the event pro- Finally, we use the estimated number of actions for each
posal, respectively, meaning all possible pairs consisting of snippet and the estimated start and end of the event to generate
start and end. We then select meaningful proposals from the event proposals, and jointly train the context-level encoder
candidate proposals Ê. To do this, we utilize the previously and the sequential event captioner to generate sentences for
computed actionness score Ta . Specifically, we choose the each event using the event proposals.
final event proposal whose average score for all slots corre-
sponding to the proposal is greater than 1. We calculate the A. TEMPORAL EVENT COUNTER
actionness score for candidate proposals using the following To train the temporal event counter, we first need to construct
equation: the target of the event counter table from ground-truth events.
Pιe A ground-truth event is denoted as e∗ = [ts∗ , te∗ ], where ts∗ and
i=ιs Ta [i] te∗ are the start and end timestamps, respectively. To rescale
S=
ιe − ιs the timestamps to the range of snippets for all event proposals,
Our event proposal generation algorithm has several we use the formula e∗ = [m · ts /d, m · te /d], where m and d
advantages over the traditional method of generating a large are the number of snippets in the video and the duration of
number of proposals and then removing duplicates. First, the video, respectively. Then, we can obtain the ground-truth
it selects event proposals considering the content, ensuring label:
meaningful events are maintained. In contrast, algorithms that
ignore content, such as NMS, simply determine duplicates O∗ = [o1 , o2 , . . . , oL ]
based on tIoU, potentially removing important proposals.
Second, it is computationally efficient as it does not generate for the counter table by counting the number of overlapping
a large number of event proposals. Furthermore, it generates events for all snippets, where ol is the number of overlapping
event proposals with just a single forward pass. events at the l-th snippet. To handle imbalanced samples in
the event counter table (where most samples are 1 to 3),
F. SENTENCE GENERATION we adopt focal loss [42]. Finally, we train the temporal event
Our sentence generation module consists of two steps: encod- counter by minimizing the negative log-likelihood loss for the
ing the context and captioning the sequential events. Before ground-truth label O∗ :
generating words/sentences directly from the event proposals,
we encode the context of event proposals for concise and L
X
LTEC = − log pcl,ol

coherent sentences with the context-level event encoder. The (13)
sequential event captioner then generates sentences with the l=1
encoded context. As shown in Figure 2, we employ two
LSTMs, which are denoted as LSTMc and LSTMs for the B. TEMPORAL BOUNDARY CLASSIFIER
context-level event encoder and the sequential event cap- To train the temporal boundary classifier, we create an event
tioner, respectively. The context-level event takes the video boundary table similar to the label of the event counter
feature Fv ∈ RT ×df corresponding to the event proposals table. We obtain all ground-truth events E ∗ = {e∗ k}K k=1 =
and encodes the context for a sequence of events, where T is {(t ∗ ks , t ∗ ke )}K
k=1 of the video. Then, we create a ground-truth
the duration of the event proposal. The sequential event cap- label for the temporal boundary classifier by indicating
tioner generates word by word while maintaining the context whether the start and end points of the event are included in
event for other sentences in the same video by leveraging the the i-th snippet with 0 or 1. Finally, we obtain the following
encoded context. Given t-th event proposal et , this process is ground-truth labels from the start and end timestamps for the
formulated as follows: events contained in the dataset:
rt = LSTMc (Fv (et ), gt−1 , rt−1 ) (11)
B∗ = [(c∗1,s , c∗1,e ), . . . , (c∗L,s , c∗L,e )]
gt = LSTMs (Fv (et ), rt , gt−1 ), (12)
where rt is the encoded context feature of t-th event proposal, where c∗s is 1 if the l-th snippet contains the start of the event
and Fv (·) is a function that outputs the video features corre- and 0 otherwise, and likewise for c∗e .
sponding to the input event proposal. We use binary cross-entropy loss (i.e., multi-label classi-
fication) because overlapping events can cause the snippet
IV. TRAINING to contain the start and end of the event together. Although
The training process for our model involves three stages. it is similar to the temporal event counter, we employ an
First, we jointly train the video encoder and the temporal asymmetric loss to handle imbalanced labels (i.e., most labels
51954 VOLUME 11, 2023

are negative samples). The loss function is defined as follows: C. IMPLEMENTATION DETAILS
The transformer of SBS’s video encoder follows the orig-
L
1X + ∗ inal paper [40]. Specifically, we set the hidden size dm of
LTBC =− [(α cl,s log pbl,s + α − (1−c∗l,s ) log(1−pbl,s ))
2 multi-head attention to 512, the number of attention heads
l=1
to 8, and the number of layers in the encoder to 6. The
+ (α + c∗l,e log pbl,e + α − (1 − c∗l,e ) log(1 − pbl,e ))] feed-forward network has 2,048 nodes. The dropout rate for
(14) residual blocks and attention is set to 0.1. Following previ-
ous works [21], [36], we set the hidden size of the event
where α + and α − are hyper-parameters for positive and context encoder and sequential captioner in the captioning
negative samples, respectively. network to a single layer of 512. To prevent overfitting and
improve generalization, we use PRELU [47] and GELU [48]
C. EVENT ENCODER AND EVENT CAPTIONER as activation functions for the CNNs and fully-connected
We use teacher forcing [43] scheme to train the event encoder layers for the temporal event counter and temporal boundary
and event captioner. Given the ground-truth sentence S ∗ , the classifier, which experimentally outperforms other options.
loss function is to minimize the negative log-likelihood of the We set the epochs for each stage of SBS to 20, 20, and 30,
ground-truth words w∗t,j for t-th event as follows: and train with an adamW optimizer [49] and a batch size
of 1. The focal loss hyperparameter, γ , is set to 2.0 (more
X Jt
T X information on γ settings is in Section V-F). Several recently
LC = − log p w∗t,j | w∗t,<j , Fv (et ) (15) proposed methods [21], [22], [38] use reinforcement learning
t=1 j=1 (RL) to further improve the captioning module. To ensure fair
comparison, we also fine-tune the context-level event encoder
where j and w∗t,<j are the current position in the sequence and sequential event captioner using RL (based on [50]) with
for prediction and the preceding ground-truth words, respec- the reward function METEOR.
tively. Jt is the length of ground-truth words for the t-th event.
V. EXPERIMENT D. PERFORMANCE COMPARISON

A. DATASET 1) EVENT LOCALIZATION
We evaluated the performance of SBS on the ActivityNet We compared several state-of-the-art dense video caption-
Captions dataset [17], which contains 19,994 YouTube ing methods to evaluate the performance of SBS. First,
videos. The dataset is divided into three subsets for train- we present the performance of event localization in Table 1.
ing, validation, and testing, consisting of 10,009, 4,917, and Above all, SBS shows the best F1 score performance among
4,885 videos, respectively. The videos range in length from other methods. MFT and SDVC generate candidate event pro-
short to long, with minimum, average, and maximum lengths posals, then use an event selection network, such as ESGN,
of 1.58, 117.60, and 755.11, respectively. The number of to remove less significant or overlapping ones. On the other
events per video ranges from 2 to 27, with minimum, average, hand, PDVC and PPVC generate proposals directly in par-
and maximum values of 2, 3.66, and 27. The lengths of the allel from a localization head composed of box prediction
sentences in the dataset range from 17 to 409 words, with an and classification. In contrast, SBS has a gradual approach
average of 67.7 words. to explicitly estimate the number of events, detect event
boundaries, and then generate event proposals. This approach
B. METRICS is more effective in localizing overlapping or short events.
Moreover, SBS can detect multiple context events in a video,
To evaluate the performance of SBS event localization,
enabling detailed video descriptions (details in Section V-E).
we compare the temporal intersection over union (tIoU)
SBS outperforms MFT by a large margin and is superior
between ground-truth events and predicted events. We mea-
to both SDVC and PDVC in terms of overall localization
sure recall and precision for thresholds of 0.3, 0.5, 0.7, and
performance and F1 score.
0.9. Specifically, we consider the sample to be true if the tIoU
between the two is above each threshold. For captioning per-
formance evaluation, we use three metrics: METEOR [44], 2) DENSE CAPTIONING
CIDEr [45], and BLEU [46]. We use publicly available evalu- Table 2 shows the performance of state-of-the-art methods in
ation code1 provided by the ActivityNet Captions Challenge. dense video captioning. When using ground-truth event pro-
Given an event and sentence pair, we calculate a captioning posals, SBS achieves a remarkable improvement compared
score by comparing the corresponding ground-truth sentence to other methods in terms of the METEOR score, which
if the tIoU between the predicted event and any ground-truth is a commonly used evaluation metric in the ActivityNet
event is greater than the threshold. Otherwise, the score is set Captions Challenge. When using predicted proposals, SBS
to 0. achieves comparable performance to state-of-the-art algo-
rithms on CIDEr and METEOR. Specifically, it exceeds SGR
1 https://github.com/ranjaykrishna/densevid_eval by 5.8 on CIDEr and is only 0.02 below SGR on METEOR.
VOLUME 11, 2023 51955

TABLE 1. Performance comparison of the event localization with respect to the 4 temporal intersection of unions (@tIoU) thresholds on the Activity
captions validation set.
TABLE 2. A summary of the performance comparison using BLEU, CIDEr and METEOR on the ActivityNet validation set. We present the performances
obtained from both learned and ground-truth events. Asterisk (*) indicates methods evaluated on incomplete dataset (e.g., 80%) due to download issues.
CE and RL stand for cross-entropy and reinforcement learning, respectively.
In particular, SBS outperforms MV-GPT or Vid2Seq with TABLE 3. Ablation results on the ActivityNet captions validation set. TEC,
TBC, NMS stand for the temporal event counter, the temporal boundary
pre-trained models. These results prove that both the classifier, and non-maximum suppression algorithm, respectively. R, P,
context-level event encoder and sequential captioner, which M stand for recall, precision, and F1 score, respectively.
consist of two LSTMs, provide excellent captioning quality.
3) INFERENCE TIME
We compare the inference time of SBS with that of other
methods in Table 5. The inference time of SBS is 1.13 sec-
onds, faster than TDA-CG and MT, but slower than PDVC.
This can be explained by the fact that SBS does not gen- SBS creates temporally overlapped events and provides dif-
erate unnecessary event proposals unlike previous works. ferent captions, demonstrating its ability to caption multiple
Secondly, SBS generates event proposals with only one stories that unfold simultaneously in the video. Second, the
feed-forward without repetitive inferences for event proposal 1d actionness table approximately represents the number of
generation. However, compared to PDVC, which decodes scene transitions and duplicate events in the video story.
captioning and localization in parallel as, SBS is inherently In the first video, it is evident that the black event interval
inferior. (where three events overlap) has the highest actionness. In the
second video, we observe a pattern where actionness rapidly
E. QUALITATIVE RESULTS decreases whenever the scene changes and remains consistent
We visualize the results for two videos included in the Activ- when the scene is maintained.
ityNet Captions validation set in Figure 3 to examine the
results of SBS in more detail through qualitative evaluation. F. ABLATION STUDY
It indicates two main features that can clarify SBS. First, We conduct several ablation studies on the ActivityNet Cap-
SBS is highly effective at detecting events that occur at the tions validation set to verify the effectiveness of SBS’s each
same time. For instance, in the first video, the red and blue modules. We compare three models of modules in differ-
events have a temporal Intersection over Union (tIoU) of over ent combinations as follows: (i) SBS without the tempo-
0.8, but SBS recognizes them as distinct events and retains ral boundary classifier, (ii) SBS without the temporal event
both. Algorithms that don’t take content into account, like counter, (iii) Full model. Specifically, model (i) generates
NMS, would remove one of the events. In the second video, a large number of valid proposals from the event counter
51956 VOLUME 11, 2023

FIGURE 3. Examples of qualitative result on the ActivityNet captions dataset. Sentences corresponding to the event are matched with the same color.
We also show the 1D actionness table obtained from the event counter table.
map and removes duplicates with a non-maximum suppres- contexts. The full model successfully localizes events and
sion algorithm because information around the boundary leads to improved performance in terms of METEOR. These
is ambiguous. Model (ii) generates event proposals from a results demonstrate that both the temporal event counter and
binary actionness map without inferring an explicit number temporal boundary classifier, which are modules for localiza-
of events. Model (iii) is a full SBS using all modules. The tion of SBS, are effective in detecting multiple overlapping
results are summarized in Table 3. Model (i) detects fewer events of dense video captioning.
events per video due to the elimination of unnecessary event We vary the focal loss hyperparameter, γ ∈ [0.5, 1.0,
proposals, which eventually leads to low recall. Furthermore, 2.0, 5.0] used in [42]. We report the F1 score of the event
ambiguous event boundaries hinder high-quality localization. localization for four trained model. The results are presented
Since model (ii) generates event proposals using actionness in Table 4. With an F1 difference of up to 2.6 over a wide
maps composed of positive or negative elements (threshold range of γ , SBS shows stable results. Based on these results,
is 0.5), it is difficult to detect events in overlapping or different we set γ to 2.0 for the best performance.
VOLUME 11, 2023 51957

TABLE 4. Ablation study on focal loss weight γ . We report F1 scores for [7] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, ‘‘Jointly modeling embedding and
tIoU thresholds of 0.3, 0.5, 0.7, and 0.9. translation to bridge video and language,’’ in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 4594–4602.
[8] L. Baraldi, C. Grana, and R. Cucchiara, ‘‘Hierarchical boundary-aware
neural encoder for video captioning,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., Jul. 2017, pp. 1657–1666.
[9] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng,
‘‘Semantic compositional networks for visual captioning,’’ in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5630–5639.
[10] Y. Pan, T. Yao, H. Li, and T. Mei, ‘‘Video captioning with transferred
semantic attributes,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
TABLE 5. Comparison of SBS’s inference time with existing methods.
(CVPR), Jul. 2017, pp. 6504–6512.
We measure the average inference time per video on the ActivityNet
Captions dataset with a single RTX 3090 GPU. [11] Y. Yu, H. Ko, J. Choi, and G. Kim, ‘‘End-to-end concept word detection for
video captioning, retrieval, and question answering,’’ in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 3165–3173.
[12] B. Wang, L. Ma, W. Zhang, and W. Liu, ‘‘Reconstruction network for video
captioning,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 7622–7631.
[13] F. Zhu, J. Hwang, Z. Ma, G. Chen, and J. Guo, ‘‘Understanding objects
in video: Object-oriented video captioning via structured trajectory and
VI. CONCLUSION
adversarial learning,’’ IEEE Access, vol. 8, pp. 169146–169159, 2020.
We propose SBS, a framework for dense video captioning that [14] R. S. Bhooshan and K. Suresh, ‘‘A multimodal framework for video caption
takes a gradual approach, mimicking human interpretation generation,’’ IEEE Access, vol. 10, pp. 92166–92176, 2022.
of video captioning from abstract to detail. SBS is designed [15] S. Li, B. Yang, and Y. Zou, ‘‘Adaptive curriculum learning for video
captioning,’’ IEEE Access, vol. 10, pp. 31751–31759, 2022.
based on several key observations in the dense video caption-
[16] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and
ing task that highly affect both event localization and event K. Saenko, ‘‘Sequence to sequence—Video to text,’’ in Proc. IEEE Int.
captioning. Our event localization is unique in that it uses Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4534–4542.
explicit event counters and temporal boundaries, maintaining [17] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, ‘‘Dense-
captioning events in videos,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
all necessary information even in highly overlapping events. Oct. 2017, pp. 706–715.
Unlike existing methods, SBS does not require any proposal [18] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei, ‘‘Jointly localizing and
selection algorithms. As a result, SBS identifies events close describing events for dense video captioning,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7492–7500.
to the ground truth in terms of the average number of events
[19] J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu, ‘‘Bidirectional attentive fusion
per video. This results in 7% higher recall than state-of-the-art with context gating for dense video captioning,’’ in Proc. IEEE/CVF Conf.
methods. SBS also uses the encoded context to construct sen- Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7190–7198.
tences while ensuring coherence between them. Experimental [20] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, ‘‘End-to-end dense
video captioning with masked transformer,’’ in Proc. IEEE/CVF Conf.
results show that SBS improves METEOR by 35% (w/GT) Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8739–8748.
while providing the same performance for captioning with the [21] J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han, ‘‘Streamlined dense video
ActivityNet Captions dataset. Furthermore, qualitative results captioning,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
show that SBS can detect highly overlapping events (with (CVPR), Jun. 2019, pp. 6588–6597.
[22] C. Deng, S. Chen, D. Chen, Y. He, and Q. Wu, ‘‘Sketch, ground, and refine:
more than 0.9 tIoU) with different stories, leading to a rich Top-down dense video captioning,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
description. Pattern Recognit. (CVPR), Jun. 2021, pp. 234–243.
[23] S. Fujita, T. Hirao, H. Kamigaito, M. Okumura, and M. Nagata, ‘‘SODA:
Story oriented dense video captioning evaluation framework,’’ in Proc. Eur.
REFERENCES
Conf. Comput. Vis., 2020, pp. 517–531.
[1] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and [24] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, ‘‘BSN: Boundary sensitive
L. Fei-Fei, ‘‘Large-scale video classification with convolutional neural network for temporal action proposal generation,’’ in Proc. Eur. Conf.
networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, Comput. Vis., 2018, pp. 3–19.
pp. 1725–1732.
[25] Z. Yuan, J. C. Stroud, T. Lu, and J. Deng, ‘‘Temporal action localization
[2] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, ‘‘ActivityNet:
by structured maximal sums,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
A large-scale video benchmark for human activity understanding,’’ in
Recognit. (CVPR), Jul. 2017, pp. 3684–3692.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
pp. 961–970. [26] F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, and T. Mei, ‘‘Gaussian temporal
awareness networks for action localization,’’ in Proc. IEEE/CVF Conf.
[3] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier,
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 344–353.
S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman,
and A. Zisserman, ‘‘The kinetics human action video dataset,’’ 2017, [27] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, ‘‘BMN: Boundary-matching
arXiv:1705.06950. network for temporal action proposal generation,’’ in Proc. IEEE/CVF Int.
[4] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 3889–3898.
K. Saenko, ‘‘Translating videos to natural language using deep recurrent [28] P. Zhao, L. Xie, C. Ju, Y. Zhang, Y. Wang, and Q. Tian, ‘‘Bottom-up
neural networks,’’ in Proc. Conf. North Amer. Chapter Assoc. Comput. temporal action localization with mutual regularization,’’ in Proc. Eur.
Linguistics, Hum. Lang. Technol., 2015, pp. 1494–1504. Conf. Comput. Vis., 2020, pp. 539–555.
[5] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and [29] I. Sutskever, O. Vinyals, and Q. V. Le, ‘‘Sequence to sequence learning
A. Courville, ‘‘Describing videos by exploiting temporal structure,’’ in with neural networks,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014,
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4507–4515. pp. 3104–3112.
[6] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, ‘‘Hierarchical recurrent [30] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares,
neural encoder for video representation with application to captioning,’’ H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using RNN
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, encoder–decoder for statistical machine translation,’’ in Proc. Conf. Empir-
pp. 1029–1038. ical Methods Natural Lang. Process. (EMNLP), 2014, pp. 1724–1734.
51958 VOLUME 11, 2023

[31] A. Liu, Y. Qiu, Y. Wong, Y. Su, and M. Kankanhalli, ‘‘A fine-grained [52] A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic,
spatial–temporal attention model for video captioning,’’ IEEE Access, and C. Schmid, ‘‘Vid2Seq: Large-scale pretraining of a visual language
vol. 6, pp. 68463–68471, 2018. model for dense video captioning,’’ 2023, arXiv:2302.14115.
[32] H. Xiao and J. Shi, ‘‘Video captioning with adaptive attention and mixed [53] M. Suin and A. Rajagopalan, ‘‘An efficient framework for dense video
loss optimization,’’ IEEE Access, vol. 7, pp. 135757–135769, 2019. captioning,’’ in Proc. AAAI Conf. Artif. Intell., 2020, pp. 12039–12046.
[33] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, ‘‘Video paragraph cap- [54] S. Chen and Y.-G. Jiang, ‘‘Towards bridging event captioner and sen-
tioning using hierarchical recurrent neural networks,’’ in Proc. IEEE Conf. tence localizer for weakly supervised dense event captioning,’’ in Proc.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 4584–4593. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
[34] Y. Xiong, B. Dai, and D. Lin, ‘‘Move forward and tell: A progressive pp. 8425–8435.
generator of video descriptions,’’ in Proc. Eur. Conf. Comput. Vis., 2018, [55] P. H. Seo, A. Nagrani, A. Arnab, and C. Schmid, ‘‘End-to-end generative
pp. 468–483. pretraining for multimodal video captioning,’’ in Proc. IEEE/CVF Conf.
[35] J. Lei, L. Wang, Y. Shen, D. Yu, T. Berg, and M. Bansal, ‘‘MART: Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 17959–17968.
Memory-augmented recurrent transformer for coherent video paragraph
captioning,’’ in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics,
2020, pp. 2603–2614.
[36] T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and P. Luo, ‘‘End-to-end
dense video captioning with parallel decoding,’’ in Proc. IEEE/CVF Int.
WANGYU CHOI received the B.S. degree in com-
Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 6847–6857.
puter science and engineering from the Tech Uni-
[37] V. Iashin and E. Rahtu, ‘‘A Better use of audio-visual cues: Dense video
versity of Korea, in 2017. He is currently pursuing
captioning with bi-modal transformer,’’ in Proc. 31st Brit. Mach. Vis.
Virtual Conf., 2020, pp. 1–22. the Ph.D. degree with the Department of Com-
[38] T. Wang, H. Zheng, M. Yu, Q. Tian, and H. Hu, ‘‘Event-centric hierarchical puter Science and Engineering, Hanyang Univer-
representation for dense video captioning,’’ IEEE Trans. Circuits Syst. sity, South Korea. His research interests include
Video Technol., vol. 31, no. 5, pp. 1890–1900, May 2021. video streaming, mobile networking, and com-
[39] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, ‘‘Learning puter vision.
spatiotemporal features with 3D convolutional networks,’’ in Proc. IEEE
Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4489–4497.
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez,
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. 31st
Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6000–6010.
[41] J. Hosang, R. Benenson, and B. Schiele, ‘‘Learning non-maximum sup- JIASI CHEN (Member, IEEE) received the B.S.
pression,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), degree from Columbia University, New York, NY,
Jul. 2017, pp. 4507–4515. USA, with internships at AT&T Labs Research,
[42] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, ‘‘Focal loss for dense Florham Park, NJ, USA, and NEC Labs America
object detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, Inc., Princeton, NJ, USA, and the Ph.D. degree
pp. 2980–2988. from Princeton University, Princeton. She is cur-
[43] R. J. Williams and D. Zipser, ‘‘A learning algorithm for continually rently an Associate Professor with the Department
running fully recurrent neural networks,’’ Neural Comput., vol. 1, no. 2, of Computer Science and Engineering, University
pp. 270–280, Jun. 1989. of California at Riverside, Riverside, CA, USA.
[44] S. Banerjee and A. Lavie, ‘‘METEOR: An automatic metric for MT Her current research interests include edge com-
evaluation with improved correlation with human judgments,’’ in Proc. puting, wireless and mobile systems, and multimedia networking, with a
ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. Summa-
recent focus on machine learning at the network edge to aid augmented
rization, 2005, pp. 65–72.
reality (AR)/virtual reality (VR) applications. She was a recipient of the
[45] R. Vedantam, C. L. Zitnick, and D. Parikh, ‘‘CIDEr: Consensus-based
Hellman Fellowship and the UCR Regents Faculty Fellowship.
image description evaluation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2015, pp. 4566–4575.
[46] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, ‘‘BLEU: A method for
automatic evaluation of machine translation,’’ in Proc. 40th Annu. Meeting
Assoc. Comput. Linguistics, 2001, pp. 311–318. JONGWON YOON (Member, IEEE) received the
[47] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Delving deep into rectifiers: B.S. degree in computer science from Korea Uni-
Surpassing human-level performance on ImageNet classification,’’ in Proc. versity, in 2007, and the M.S. and Ph.D. degrees
IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034.
from the University of Wisconsin–Madison, in
[48] D. Hendrycks and K. Gimpel, ‘‘Gaussian error linear units (GELUs),’’
2012 and 2014, respectively. He is currently
2016, arXiv:1606.08415.
[49] I. Loshchilov and F. Hutter, ‘‘Decoupled weight decay regularization,’’
an Associate Professor with the Department of
2017, arXiv:1711.05101. Computer Science and Engineering, Hanyang
[50] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, ‘‘Self-critical University, South Korea. He is also the Founding
sequence training for image captioning,’’ in Proc. IEEE Conf. Comput. Vis. Director of the Intelligent Machines Laboratory
Pattern Recognit. (CVPR), Jul. 2017, pp. 7008–7024. which broadly focuses on research in machine
[51] W. Choi, J. Chen, and J. Yoon, ‘‘Parallel pathway dense video captioning learning, artificial intelligence, mobile systems, and HCI. He was a recipient
with deformable transformer,’’ IEEE Access, vol. 10, pp. 129899–129910, of the Lawrence Landweber Fellowship during his Ph.D. study.
2022.
VOLUME 11, 2023 51959

Step by Step a Gradual Approach for Dense Video Captioning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Step by Step a Gradual Approach for Dense Video Captioning

Uploaded by

Copyright:

Available Formats

Received 8 May 2023, accepted 20 May 2023, date of publication 24 May 2023, date of current version 1 June 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3279816

Step by Step: A Gradual Approach for Dense

Corresponding author: Jongwon Yoon (jongwon@hanyang.ac.kr)

I. INTRODUCTION shortcomings, dense video captioning has recently emerged

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

51950 VOLUME 11, 2023

B. PROS AND CONS OF EXISTING METHODS C. DESIGN CHOICE

VOLUME 11, 2023 51951

51952 VOLUME 11, 2023

VOLUME 11, 2023 51953

51954 VOLUME 11, 2023

V. EXPERIMENT D. PERFORMANCE COMPARISON

VOLUME 11, 2023 51955

51956 VOLUME 11, 2023

VOLUME 11, 2023 51957

51958 VOLUME 11, 2023

VOLUME 11, 2023 51959

You might also like