Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Neural Computing and Applications (2021) 33:8335–8354

https://doi.org/10.1007/s00521-020-05587-y (0123456789().,-volV)
(0123456789().,-volV)

ORIGINAL ARTICLE

Student Class Behavior Dataset: a video dataset for recognizing,


detecting, and captioning students’ behaviors in classroom scenes
Bo Sun1 • Yong Wu1 • Kaijie Zhao1 • Jun He1 • Lejun Yu1 • Huanqing Yan1 • Ao Luo1

Received: 9 May 2020 / Accepted: 11 December 2020 / Published online: 30 January 2021
Ó The Author(s), under exclusive licence to Springer-Verlag London Ltd. part of Springer Nature 2021

Abstract
The massive increase in classroom video data enables the possibility of utilizing artificial intelligence technology to
automatically recognize, detect and caption students’ behaviors. This is beneficial for related research, e.g., pedagogy and
educational psychology. However, the lack of a dataset specifically designed for students’ classroom behaviors may block
these potential studies. This paper presents a comprehensive dataset that can be employed for recognizing, detecting, and
captioning students’ behaviors in a classroom. We collected videos of 128 classes in different disciplines and in 11
classrooms. Specifically, the constructed dataset consists of a detection part, recognition part, and captioning part. The
detection part includes a temporal detection data module with 4542 samples and an action detection data module with 3343
samples, whereas the recognition part contains 4276 samples and the captioning part contains 4296 samples. Moreover, the
students’ behaviors are spontaneous in real classes, rendering the dataset representative and realistic. We analyze the
special characteristics of the classroom scene and the technical difficulties for each module (task), which are verified by
experiments. Due to the particularity of classrooms, our datasets proposes increasing the requirements of existing methods.
Moreover, we provide a baseline for each task module in the dataset and make a comparison with the current mainstream
datasets. The results show that our dataset is viable and reliable. Additionally, we present a thorough performance analysis
of each baseline model to provide a comprehensive comparison for models using our presented dataset. The dataset and
code are available to download online: https://github.com/BNU-Wu/Student-Class-Behavior-Dataset/tree/master.

Keywords Student Class Behavior  Video dataset  Recognition dataset  Detection dataset  Caption dataset 
Baseline

1 Introduction datasets cannot meet real-world needs. Such needs and the
challenges associated with the algorithms used have
The rapid development of artificial intelligence (AI), attracted the interest of researchers. Therefore, establishing
especially deep learning, has led to significant development datasets that can provide a basis for research has gained
in the field of computer vision, and this has led to a wide importance. Currently, remarkable progress has been made
range of applications, such as object recognition [1–3], in building increasingly complex and realistic datasets,
object detection [4–7], object tracking [8–10], video such as the AVA [21] and VATEX [22] datasets. However,
retrieval [11–16], and visual question answering (VQA) there still exists a lack of datasets in education, and this
[17]. Algorithms generally perform well on simple data- strongly limits AI development in this field. Moreover,
sets, such as the UCF101 [18], HMDB [19] and MSVD since education scenarios are complex, algorithms face
[20] datasets. However, owing to the complexity and many new challenges. Based on these findings, this study
diversity of real life, models that perform well on simple establishes a comprehensive dataset for the field of
education.
Evaluations of education quality have attracted an
& Jun He
hejun@bnu.edu.cn increasing amount of attention from researchers in fields
such as pedagogy and psychology. As a basic teaching
1
School of Artificial Intelligence, Beijing Normal University, form, classroom teaching has always been the core of
Xinjiekouwai Street No. 19, Beijing, China

123
8336 Neural Computing and Applications (2021) 33:8335–8354

education. As part of a certain scenario, students’ behaviors namely, a detection part, recognition part and caption part.
in a classroom are significant and not disregarded. The former two parts are set up via action annotation, and
Acquiring information about student behaviors is not only the latter is set up with caption annotation. The detection
helpful for mastering students’ learning, personality and part contains two data modules, while each of the other two
psychological traits but also worthy of inclusion in evalu- parts contains one module, as shown in Fig. 1.
ations of education quality. The dataset has the following six characteristics: (1)
While many studies have addressed behavior recogni- Special scenario. The dataset considers the core educa-
tion or detection, they are time-consuming or inefficient. tional scene, i.e., a classroom, as the collection scenario,
With the advent of the era of big data, classroom video data which has excellent prospects for research and application.
are often available, e.g., at Beijing Normal University, (2) Real scene. We collect student’s spontaneous actions in
3.4 TB of videos are recorded every day in 244 classrooms, real classes using teaching videos. The videos show real
thereby enabling advanced AI technology to automatically teaching scenes, and the students’ behaviors spontaneously
recognize, detect and caption students’ behaviors. While occur. Thus, the dataset is representative and realistic. (3)
research on education using AI is still developing, research Human-centric annotation. Rather than labeling an entire
on and prospective application of this technology are video with atomic actions or temporal segmentations, this
worthwhile. Currently, a specific dataset of students’ dataset annotates students’ actions, each of which is com-
classroom behaviors, which is necessary and primal for posed of several atomic actions, e.g., drinking. These
deep learning, is not available. actions reflect the student’s states, which facilitate further
As the main goal of computer vision, understanding research on educational topics. (4) Comprehensive and
visual scenes involves many tasks, including recognition, with multiple tasks. This dataset is a comprehensive
detection, and captioning. Although several datasets exist dataset for multiple tasks, namely, recognizing, detecting,
for each task, few datasets exist simultaneously, especially and captioning students’ behaviors. A number of tasks is
in the field of education. Clearly, a class contains multiple employed to automatically detect and caption students’
subjects, i.e., students often simultaneously perform dif- states by using AI technology. Because these collections
ferent types of actions that are often visually similar. The are closely related, multiple data modules are set up
tasks of automatically recognizing, detecting, and even simultaneously. (5) More complex caption annotations.
captioning these actions clearly pose strong challenges to We provide more detailed captions than those of other
most traditional approaches. Thus, special research via captioned datasets, and this allows us to better understand
deep learning has considerable potential. students’ actions. (6) Challenging. The spontaneity of
Since a dataset is primal for deep learning research, we students’ actions poses many challenges to existing meth-
present a student class behavior dataset in this paper. This ods. Because classroom scenes may contribute to low video
dataset is a video collection of students’ spontaneous quality, with issues such as occlusion, the presence of
actions in real classes obtained from 4.5 K student multiple subjects, large differences in subject sizes, and
behavior video clips from different disciplines and differ- few differences among action categories, an increased
ent educational stages. The actions are divided into 11 requirement for algorithmic research is proposed for a
typical behaviors, including taking notes, using computers, highly robust algorithm.
and discussing. Each behavior has a set of four corre- The remainder of this paper is organized as follows:
sponding descriptive sentences that are manually cap- Sect. 2 reviews existing datasets for action recognition,
tioned. The dataset contains three submodules that depend detection, and captioning. Section 3 describes the student
on different tasks, namely, (1) trimmed videos for recog- class behavior dataset. Section 4 introduces the character-
nition; (2) untrimmed videos for detection; and (3) cap- istics of the dataset and the challenges faced by existing
tioned tasks based on the recognition task. In addition to methods. Section 5 presents the baseline and analyzes the
providing the baseline algorithms, we conduct a compar- difference between this dataset and other datasets, and
ative experiment with current mainstream datasets to val- Sect. 6 concludes the paper.
idate its challenges.
Aimed at further education research, describing stu-
dents’ behaviors are necessary, for which detecting sub- 2 Related work
jects and their actions are always helpful. A video in a
classroom always contains multiple students who perform 2.1 Action recognition and detection datasets
different types of actions. The spatial detection of subjects
and the temporal detection of actions prior to multisubject The amazing results in the field of action recognition are
behavior recognition are necessary. Thus, the student class largely due to several mainstream behavioral recognition
behavior dataset proposed in this paper contains three parts, datasets, such as the KTH [23], Weizmann [24],

123
Neural Computing and Applications (2021) 33:8335–8354 8337

Fig. 1 Structure of the student class behavior dataset

Hollywood-2 [25], HMDB, and UCF101 datasets. The videos, each of which contains at least one action. Activ-
limitations of the KTH and Weizmann datasets, which are ityNet is currently the largest dataset of this type, with
that they have the simple backgrounds, a lack of camera more than 648 h of untrimmed videos, approximately
movement, few kinds of actions, and only one person in 20,000 videos, and 200 different daily activities. This
each video with a single movement, are obvious. The dataset provides temporal (but not spatial) localization for
scenes of these datasets are highly different from scenes in each action of interest.
reality. In video samples of the Hollywood-2 dataset, since In recent years, some datasets based on spatiotemporal
such factors as the expressions, postures, clothes, camera annotations, such as the CMU [37], MSR Actions [38],
movements, illumination changes, occlusions, and back- UCF Sports [39], JHMDB [40] and AVA datasets, have
grounds of the actors vary greatly, but are similar to situ- emerged. Some datasets, such as the UCF101, DALY [41]
ations in real scenes, the process of analysis or recognition and Hollywood2Tubes [42] datasets, have been upgraded
on this dataset is highly challenging. The maximum num- from their original contents to spatiotemporal localization
ber of action categories in the UCF101 dataset and in untrimmed videos.
HMDB51 dataset are 101 and 51, respectively. Their
uncertainties, which are induced by camera motion, clut- 2.2 Video caption datasets
tered backgrounds, different lighting conditions, occlusion,
and videos with low quality, have made research chal- It has been a fundamental yet evolving challenge for
lenging and have attracted the attention of many computer vision to automatically caption visual content
researchers. In addition to the previously mentioned data- with natural language. In particular, there has been enor-
sets with short clips of manually tailored, individual mous interest in this area due to the presence of increas-
behavioral activities, some automatically labeled datasets, ingly large datasets. The caption datasets can be divided
such as the TrecVid MED [26], YouTube-8M [27], into specific fields and open fields based on their applica-
something-something [28], Sports-1M [29], SLAC [30], tion backgrounds.
moments in time [31] and kinetics [32] datasets, have The specific field has two main categories: cooking and
emerged. The application of automatic labels arose due to watching movies. Cooking datasets mainly include the MP-
the high cost of manual investment. Clearly, this invest- II Cooking [43], YouCook [44], TACoS [45], TACoS
ment is performed at the expense of noisy labels, which Multilevel [46] and YouCook-II [47] datasets. The MP-II
inevitably degrade the detection and recognition perfor- Cooking dataset contains 65 cooking activities, in which
mances of the model compared with those obtained by the data are recorded in the same kitchen by a camera mounted
same model with completely artificially labeled datasets. on the ceiling. The YouCook dataset contains 88 YouTube
With further research, both datasets and methods have cooking videos about the processes by which different
begun to utilize temporal localization, e.g., the ActivityNet people cook a variety of recipes. Compared with the MP-II
[33], THUMOS [34], MultiTHUMOS [35] and Charades Cooking dataset, the YouCook dataset is more challenging
[36] datasets, which use a large number of untrimmed and has different backgrounds and multiview shots.

123
8338 Neural Computing and Applications (2021) 33:8335–8354

TACoS contains 26 fine cooking events in 127 videos; each collected candidate action class sets with obvious visual
video contains 20 different text captions. TACoS Mul- characteristics in mass classroom data. From these sets, we
tilevel was upgraded based on TACoS; it provides three selected typical classroom behaviors. Eventually, an action
levels of captioning for each video in the TACoS corpus, vocabulary with 11 typical behaviors, e.g., taking notes,
including a detailed caption, a short caption, and a brief using mobile phones, and carefully listening to the teacher,
caption of a single sentence. Typical movie-based datasets became available.
are the MPII-Movie Description Corpus [48] and the
Montreal Video Annotation Dataset (M-VAD) [49]. The 3.2 Educational video selection
former contains transcribed audio captions extracted from
94 Hollywood movies; the average length of each movie is In Beijing Normal University, 244 classrooms are moni-
3.9 s with almost one sentence. The latter contains 48,986 tored by 1920 * 1080 high-definition cameras (Brand:
video clips extracted from 92 different movies. Each clip HIKVISION; type: DS-2DF6223), which are deployed by
spans an average of 6.2 s, and the total number of sen- the Public Resources Management Center. Figure 2 shows
tences is 55,904. the deployment of one such camera. Each day, these
Datasets with open fields mainly include the Microsoft cameras produce 3.4 TB of data, i.e., daily classroom
Video Description (MSVD), the MSR-Video to Text videos, which provide the foundation for our data
(MSR-VTT) [50], ActivityNet Captions [51], ANet-Enti- collection.
ties [52], Charades [36] and VideoStory [53] datasets. The Considering the illumination of and camera placement
MSVD dataset contains an average of 41 manually labeled in a classroom, videos of 11 classrooms were adopted. A
sentences per video. The videos are obtained from You- total of 128 untrimmed videos from different disciplines
Tube, and the length of each video, which mainly shows an and 13 courses (including computers, mathematics, and
event, ranges between 10 and 25 s. The MSR-VTT dataset geography) were collected. The students in the videos,
is divided into 20 different categories; each category is which comprise undergraduates, master’s students, and
captioned by 20 Amazon Mechanical Turk (AMT) work- doctoral candidates, are at different stages of education.
ers. In addition to visual content, this dataset contains audio The number of participants in each class varies from 7 to
information that may be employed for multimodal 34.
research. Charades contains 9848 daily indoor family
activity videos. The average duration of each video, which 3.3 Action annotation
mainly depicts activities of daily living, is 30 s. Different
from other datasets, the ActivityNet Captions dataset As previously mentioned, describing a classroom video
describes a certain activity in a video, that is, it describes a often entails describing different types of actions per-
certain piece of visual content in the video; therefore, a formed by multiple students and requires multisubject
corresponding start time and end time exist. The task behavior recognition, which involves not only the recog-
requires that the corresponding event be detected and then nition of certain actions but also the detection of their
captioned, which is clearly the more challenging part. The temporal and spatial locations. Thus, action annotation is
ANet-Entities dataset, which was the first dataset with an related to the detection and recognition parts of our pro-
entity base and annotations, is based on the ActivityNet posed dataset. The former includes two data modules: a
Captions dataset. VideoStory was designed to solve the module for temporal detection and a module for action
narrative or caption generation problem of videos. Each detection.
paragraph contains an average of 4.67 sentences, with an
average of 13.32 words per sentence. Table 1 [54] shows 3.3.1 Recognition part
information obtained from current mainstream video cap-
tion datasets. Students’ classroom behaviors are always considered as
one of their external performances with regard to student
states. Automatically obtaining their behaviors and manu-
3 Data collection ally recording them are difficult tasks. For further research
on recognition algorithms for students’ classroom behav-
3.1 Action vocabulary generation iors, the recognition part of our proposed dataset is pro-
vided for the recognition task.
Since students may act randomly and arbitrarily, many
Recognition data module
types of student behaviors are observed in classrooms. We
The recognition part is one module that consists of many
obtained a reasonable label vocabulary based on classroom
video clips of single behaviors of a single subject. By
rules and regulations and the related literature [56–58]. We

123
Neural Computing and Applications (2021) 33:8335–8354 8339

Table 1 Existing video caption


Dataset Domain Classes Videos Clips Sent Words Vocab
datasets
MP-II Cook Cooking 65 44 – 5609 – –
YouCook Cooking 6 88 – 2688 42,457 2711
TACoS Cooking 26 127 7206 18,227 146,771 28,292
TACosMlevel Cooking 1 185 14,105 52,593 – –
YouCook-II Cooking 89 2000 15.4 k 15.4 k – 2600
MPII-MD Movie – 94 68,337 68,375 653,467 24,549
M-VAD Movie – 92 48,986 55,904 519,933 17,609
VTW [55] Open – 18,100 – 44,613 – –
MSVD Open 218 1970 1970 70,028 607,339 13,010
MSR-VTT Open 20 7180 10 k 200 k 1856,523 29,316
ActyNet Cap Open – 20 k – 100 k 1348 k –
ANet-Entities Open – 14,281 52 k – – –
Charades Open 157 9848 – 27,847 – –
VideoStory Open – 20 k 123 k 123 k – –
Dataset: Dataset Name; Domain: The area of the video content; Classes: The number of video categories;
Videos: The number of videos collected; The number of video clips cropped out from the original video;
Sent: The number of captions in the dataset; Words: The number of words in the dataset; Vocab: The
number of vocabularies in the dataset

Fig. 2 Model diagrams of a classroom from different views. a Side view, b top view

manually annotating the categories of subjects’ actions, and y-coordinate of the lower right corner of the action
subjects’ positions, and the start/end times for whole subject. For instance, 47-f-03-20-discuss-00_01_10-
videos, we can obtain the clips. 00_01_35-263-160-784-533 corresponds to the time period
First, we numbered the original classroom videos. Sec- from 00:01:10 to 00:01:35 in Video No. 47-f-03. An action
ond, 128 videos were distributed to 12 professionally of discussion is conducted by the subject with the coordi-
trained annotators who watched and labeled the videos nates (263, 160) in the upper left corner and the coordinates
according to our students’ action classes. Since the videos (784, 533) in the lower right corner.
contained all the students in the class, each spatial location
was marked with the coordinates of the upper left corner 3.3.2 Detection part
and lower right corner and the start and end times of each
action. Last, with special software named Format Factory, Automatically detecting students’ classroom behaviors in
we obtained the partial clips based on the annotations and original videos is essential for automatically understanding
then converted them into single-behavior video clips for a students’ learning states and evaluating education quality.
single subject. For further research, the detection part of the dataset, which
Therefore, the annotation for each clip contains nine is composed of two data modules, i.e., a data module for
pieces of information, namely, the video clip number, temporal action detection and a data module for action
action number, action category label, start time, end time, detection, is essential.
x-coordinate of the upper left corner of the action subject, Detecting a student’s behavior state and its changes by
y-coordinate of the upper left corner of the action subject, performing temporal action detection is helpful for ana-
x-coordinate of the lower right corner of the action subject, lyzing his/her learning situation and psychological changes

123
8340 Neural Computing and Applications (2021) 33:8335–8354

to personalize his/her education. This dataset provides a Annotation


temporal detection data module.
The above processing provided us with a video clip
Detecting the states of students in the classroom by
sample; we used the action category and the coordinates of
action detection is helpful for evaluating teaching quality
the annotated action in the recognition module as its label.
and investigating students’ interests regarding different
As a result, a total of 3343 samples was collected to build
teaching methods or topics, adjusting the teaching plan and
the action detection data module.
improving teaching quality. This dataset provides an action
detection data module.
3.4 Caption annotation
Temporal detection data module
Collection of videos and clip selection A caption may precisely express the details of an activity.
Although a student’s behavior is available with the label,
Each collected original video is a 45-min video of the
his/her mental state and action details cannot be observed.
entire class. We acquired the space coordinates, as well as
They are the focus of our study. Based on the recognition
the starting and ending times of the action, based on the
data module, we captioned each video clip to build a
annotation of the recognition module. To obtain the sam-
caption data module.
ples for the temporal detection task, we cropped out the
videos of each student according to their coordinates. As
3.4.1 Caption data module
direct detection for the 45-min long video was challenging,
we divided the original video into several video clip sam-
Because the recognition module only recognizes a single
ples. Considering that approximately 90% of the actions
topic of the video, much other information could not be
annotated were within 40 s, we segmented them into
extracted at all. However, more specific descriptions of the
samples of 90 s. Once a cutting point was contained in an
visual content were demanded in the educational scene,
action, we extended it to the next clip to guarantee that
especially regarding student behaviors. Therefore, based on
each clip sample contained at least one annotated action.
the samples from the recognition module, we built the
The average length of the clip samples was 96.6 s.
caption module by deleting low-quality clips and adding
Annotations
some good clips, resulting in a total of 4296 samples.
The above processing provided us with a video clip
Collection of videos and clip selection
sample; we used the action category, together with the
Sentence annotation
starting and ending points of the annotated action in the
recognition module as its label. As a result, a total of 4542 Generally, most caption datasets describe a video in
samples was collected to build the temporal detection data only a simple way, but we needed a more detailed
module. Figure 3 shows an example. description. Therefore, we were strict with the annotator
who participated in the caption annotation process. The
Action detection data module
whole procedure can be described as follows: First, we
Collection of videos and clip selection
recruited and trained a group of annotators; second, we
To obtain the action detection sample, we divided the asked different annotators to describe the same video; and
original video into several video clip samples based on the finally, we asked reviewers to review the descriptions. The
starting and ending times of the annotated action in the details are as follows:
recognition module. There existed at least one annotated During annotator training, we conducted a qualification
action in each video clip sample. review (mainly to test their English level/ability) and

Fig. 3 Example of the temporal detection data module that shows a clip of 5 frames extracted from the video at a medium distance. The clip
shows that the subject has a reading action between 42 and 84 s

123
Neural Computing and Applications (2021) 33:8335–8354 8341

professional training for all the annotators. All the qualified complexity and specificity of educational scenes, this
people were divided into two groups, i.e., annotators and dataset has some different characteristics, which could
reviewers. The former group was in charge of captioning bring new challenges to researchers. The details of the
the videos, and the latter was responsible for checking the dataset are introduced in the following sections.
caption quality, e.g., spelling and syntax.
Four annotators were asked to caption each labeled and 4.1 Characteristics of the dataset
trimmed video with one sentence. The captioning process
had four requirements [59]: 4.1.1 The classroom as the collection scenario
1. Use actions and nouns that will appear frequently in a
As the basic form, classroom teaching always has a core
class. For example, talk, hands up, and blackboard.
role in education. Acquiring student behavior is the key,
2. Do not describe any ‘‘predicted’’ information. For
which is not only helpful for mastering students’ learning,
example, it seems that the class will be over.
personality and psychological traits but also worthy of
3. Do not use special nouns, such as people’s names.
inclusion in evaluations of education quality. With the help
4. Do not use simple descriptions. For example, a group
of massive amounts of classroom teaching videos gener-
of people who are sitting.
ated in the era of big data, students’ classroom behaviors
After captioning the annotation, the reviewers examined can be automatically recognized, detected and captioned by
the spelling and sentence syntax of each annotations. AI technology to evaluate and improve the quality of
classroom teaching. However, the required dataset has not
3.5 Quality control and privacy protection been developed in the educational scene, thereby hindering
related research in this respect. Therefore, the establish-
Regarding privacy protection, the annotators were required ment of a student behavior dataset is urgently needed.
to sign an agreement before participating in our work, in
which they agreed to not disseminate or modify the original 4.1.2 Real videos as annotated material
video. In addition, all personal information of the subjects,
including names and addresses, were deleted or hidden In movie datasets, such as the MPII-Movie dataset, the
during the whole database construction process. behavior of people is artificially limited. In reality, people’s
Three measures were taken to ensure the quality of the behavior is more complicated and abundant. To reflect
annotations: (1) annotator training. A set of annotator reality, our video materials were collected directly from
training programs was developed and conducted for each actual classroom scenes. Student behaviors in the videos
volunteer. Only volunteers who passed the pre-annotating were spontaneous, which guaranteed the authenticity and
appraisal could become annotators. (2) Pre-annotating and representativeness of our dataset.
annotator screening. After the training process was com-
pleted, small batches of videos were distributed to the 4.1.3 Human-centric annotation
volunteers for pre-annotating, and their annotation work
was supervised and guided. With some evaluation, anno- The objects of our research were students. We aimed to
tators were screened. (3) Annotation and review. The reflect the students’ learning situations and psychological
annotations were reviewed by special staff after recollec- states through their actions. Therefore, unlike other data-
tion of the videos. sets captioning the entire video in general, our dataset is
human-centric with annotations based on students, whether
3.6 Dataset splits in the recognition, detection, or caption modules, and the
dataset effectively satisfies the requirements in the educa-
Referring to the current mainstream dataset division prin- tional field.
ciple, we divided the samples of each module randomly
into a training set, validation set and test set at a ratio of 4.1.4 Comprehensive and multitask dataset
3:1:1, as shown in Table 2.
At present, the lack of comprehensive datasets causes some
problems in AI research in the field of education. To
4 Student behavior dataset alleviate this problem, we built a comprehensive dataset
that can complete various tasks in this field.
The paucity of data has restricted the application of AI in
education. Therefore, we constructed a student behavior
dataset to provide a basis for such research. Given the

123
8342 Neural Computing and Applications (2021) 33:8335–8354

Table 2 Partition sizes of the


Student Class Behavior Dataset
dataset
Recognition data module Action detection data module Caption data module
Temporal detection Action detection

Total sample 4276 4542 3343 4296


Training 2559 2725 2014 2602
Validation 858 908 659 862
Test 859 909 670 832

4.1.5 More complex caption annotations 4.1.6 Challenging

The use of richer language, such as adjectives, adverbs, and Because classroom scenes always contribute to low-quality
more complex sentence structures, to caption a behavior and complex videos, our dataset is highly challenging, and
can better reflect the state and details of a subject. There- this is addressed in the next section.
fore, we provide more detailed captions than other caption
datasets, and this allows us to better understand the stu- 4.2 Challenges of the dataset
dents’ actions. Figure 4 shows an example of different
caption datasets. Obviously, our captions are more detailed, 4.2.1 Multiple subjects
and the sentence structure is more complicated than those
of other datasets. A classroom is a multisubject environment in which dif-
ferent subjects perform different actions; for example, a
maximum of 34 subjects simultaneously appear in our

Fig. 4 Examples of three


different datasets. a and
b Captions from the MSVD
dataset and MSR-VTT dataset,
respectively, and c is a caption
from our dataset

123
Neural Computing and Applications (2021) 33:8335–8354 8343

dataset. This situation creates great challenges for the


detection task.

4.2.2 Large differences in subject sizes (pixels)

In a classroom, students sit in different positions; therefore,


their picture sizes vary in the video, thereby creating extra
challenges for existing methods. Figure 5 shows an
example of a large difference in subject sizes in a video. Fig. 6 Examples of the visual similarity between two different action
classes. a and b Show the ‘‘taking notes’’ behavior and the ‘‘using
4.2.3 Less difference between action classes (similar phone’’ behavior, respectively
actions)
occluded to varying degrees, especially when many
Limited by the characteristics of specific actions, some students are in a classroom. This finding indicates the
behavior classes are visually similar, which undoubtedly need for higher requirements in algorithmic research
creates challenges for recognition and detection. Figure 6 than the current requirements. Figure 8 shows two
shows the visual similarity between two different action examples of occlusion.
classes. We observe that their actions are very similar
visually. 4.3 Annotation statistics

4.2.4 Less difference between the foreground 4.3.1 Recognition


and background
The actions are divided into 11 classes, namely, listening
Due to the characteristics of classroom teaching, students’ carefully, taking notes, using mobile phones, yawning,
activities are always restricted to a certain extent, and this eating or drinking, reading, discussing, looking around,
reduces the difference between the foreground and the using computers, sleeping or snoring, and raising hands.
background. This restriction, which occurs for different Figure 9 shows the number of samples for each class. The
subjects and in different periods for the same subject, class with the highest frequency is ‘‘listening carefully’’,
hinders temporal and spatial action detection. Figure 7 while the class with the lowest frequency is ‘‘raising
shows an example of a similar foreground and hands’’. Since the collected samples depict students with
background. high levels of education, the behavior of ‘‘raising hands’’ is
relatively rare. Thus, this class is removed from the base-
4.2.5 Occlusion line for recognition and action detection.

Due to the limitations of the scene and the deployment 4.3.2 Action detection
of cameras, some of the students’ behaviors are
A total of 3343 video samples was collected in this module;
each sample contains multiple subjects, from 7 to 19
people. Figure 10 shows the distribution of the number of
subjects in each video clip sample. The average resolution
of a video sample is 1450 9 870, and the average length of
a video is 17.2 s.

4.3.3 Temporal detection

A total of 4542 untrimmed videos was collected in this


module, with a total capacity of 9.64 G. Each video con-
tains at least one annotation. The longest length of the
untrimmed videos is 187 s, the shortest length of the
untrimmed videos is 15 s, and the average length of the
untrimmed videos is 96.6 s. Figure 11 shows the distribu-
tion of the duration of each action. As shown, 95% of the
Fig. 5 Examples of multisubject situations. The sizes of the subjects
in the two red boxes in the video differ substantially

123
8344 Neural Computing and Applications (2021) 33:8335–8354

Fig. 7 Examples of the ‘‘taking notes’’ behavior. A clip of 5 frames was extracted from a video at a medium distance

4.3.4 Video captioning

A total of 4295 videos was captioned in this module; each


module contains four descriptive sentences with both
Chinese and English versions. In the English version, a
total of 261,582 words from 1850 vocabularies are inclu-
ded, with an average of 15.3 words per sentence. Compared
with several mainstream datasets, such as the MSVD
Fig. 8 Examples of occluded behaviors. a and b The ‘‘using phone’’ dataset with 8.7 words per sentence and the MSR-VTT
behavior and the ‘‘reading book’’ behavior, respectively dataset with 9.3 per sentence, our dataset has more detailed
captions and more complicated sentence structures.
action durations are less than 50 s, among which 71% are
less than 20 s.

Fig. 9 Distribution of the 1200


numbers of behaviors in all 984
Number of each class

action classes 1000

800
582 545
600 520 515
365
400 265 252
168
200 80
15
0

Fig. 10 Distribution of the 800


number of subjects in this 700
Number of samples

dataset
600
500
400
300
200
100
0
7 8 9 10 12 13 14 15 16 17 18 19

Number of people

123
Neural Computing and Applications (2021) 33:8335–8354 8345

Fig. 11 Distribution of the 2000 1877


duration of each action
1800
1600

Number of samples
1346
1400
1200
1000
800 708
600
400 253
147 74
200 57 28 31 33
0
1~10 11~20 21~30 31~40 41~50 51~60 61~70 71~80 81~90 91~100
Duration of action(s)

5 Baseline and analysis 5.1.2 Datasets and metrics

5.1 Recognition We compare two standard datasets—HMDB51 and


UCF101—which are commonly employed for behavior
5.1.1 Model and setting recognition. The HMDB51 dataset is mainly derived from
the video network site YouTube and digital cinema, and it
The model employed for the baseline is the two-stream contains a total of 6676 videos of 51 categories with a
network [60], which utilizes the notion that the appearance spatial resolution of 320 pixels 9 240 pixels. The UCF101
of a subject in a static frame and motion information dataset contains a total of 13,320 videos in 101 categories.
between successive frames can complement the recogni- Each type of video is divided into 25 groups, and each
tion of a specific behavior. The two-stream network takes group contains at least 100 videos, with video lengths from
ResNet-101 as the backbone for both the spatial and tem- 29 to 1776 frames and a spatial resolution of 320 pix-
poral streams. The spatial stream captures the spatial els 9 240 pixels. For these two datasets, we apply the
information of behaviors from the static video frames, standard training/testing splits and protocols provided by
while the temporal stream captures the motion information the datasets. For the HMDB51 and UCF101 datasets, we
of behaviors from the motion of the dense optical flow. report the average accuracy over the three splits. The for-
The training procedure is generally the same for both mula for average accuracy is shown below:
spatial networks and temporal networks. The network num of videos classified correctly
weights are learned using the mini-batch stochastic gradi- Average Acc ¼
num of all test videos
ent descent with momentum (set to 0.9). At each iteration,
a mini-batch of 8 samples is constructed by sampling 8
training videos (uniformly across the classes); a single 5.1.3 Evaluation and analysis
frame is randomly selected from each video. In spatial
network training, a 2249224 sub-image is randomly Table 3 shows the experimental results of the three datasets
cropped from the selected frame; it then undergoes random for the two streams. As mentioned in Sect. 4.2, due to the
horizontal flipping and RGB jittering. In temporal network particularity of the classroom scene, the movements of
training, we compute the optical flow volume I for the students’ classroom behaviors are relatively small, so they
selected training frame. From this volume, a fixed-size generate relatively insufficient motion information in the
224 9224 92 L image is randomly cropped and flipped. optical flow. The motion information of the ‘‘listening
For this task, L is set as 10. The learning rate is initially set carefully’’ sample in Fig. 12 is very small, and this causes
to 1023 and then decreased according to a fixed schedule,
which is kept constant for all training sets. In the fine- Table 3 Mean accuracy on UCF-101, HMDB-51 and our dataset
tuning step, when training ConvNet from scratch, the rate UCF101 (%) HMDB-51 (%) Our dataset (%)
is changed to 1023 after 10 K iterations, changed to 1024
Spatial stream 73 40.5 73.5
after 14 K iterations, and stopped after 20 K iterations. The
whole network is trained on an Nvidia GTX 1080Ti GPU Temporal stream 83.7 54.6 51
using the PyTorch deep learning framework. Two-stream 86.9 58 61

123
8346 Neural Computing and Applications (2021) 33:8335–8354

Fig. 12 Example of the ‘‘listening carefully’’ behavior. A clip of 5 frames was extracted from the ‘‘listening carefully’’ video at equal distances

poor performance on the temporal stream, thereby affecting ACAM model


the effect of the two-stream network. Figure 13 shows The ACAM model generates attention maps conditioned
some failure cases of the baseline model in the recognition on each actor from contextual and actor features. Attention
module. We can see that some different behaviors of stu- maps are generated for each feature dimension, and they
dents in the classroom are very similar visually. This kind determine the relation of the actor to every spatiotemporal
of similarity between classes brings great challenges to the context location. This attention mechanism allows us to
algorithm. We enumerate the comparison results of the focus on the actor without cropping, as in RoIPooling,
three datasets using the two-stream network, as shown in while capturing the spatiotemporal structure of the scene.
Table 3. In addition, Table 4 shows the performance of the We initialize the models with I3D weights trained on the
model, which provides a benchmark for subsequent Kinetics-400 dataset and train our models with an Adam
research. optimizer and a cosine learning rate between max (0.02)
and min (0.0001) for 40 epochs. We employ a batch size of
5.2 Action detection 2 per GPU, and we have 2 Nvidia 1080 Ti’s (total batch
size of 4).
5.2.1 Model and setting
MOC model
The MOC model presents an action tubelet detection
The models utilized for the baseline are the ACAM [61]
framework, termed the moving center detector (MOC-
and the MOC [62] models.

Label: playphone Two-stream: sleep

Label: playphone Two-stream: note

Fig. 13 The failure recognition of the two-stream model on our dataset

123
Neural Computing and Applications (2021) 33:8335–8354 8347

Table 4 Performance of the


Parameters Train time (h) Test time (s) Results (%)
two-stream model on our
dataset Spatial stream – 14.7 4.1 73.5
Temporal stream – 15.6 12.5 51
Two-stream 42,707,109 – 9.2 61

detector), by treating an action instance as a trajectory of 5.2.3 Evaluation and analysis


moving points based on the insight that movement infor-
mation could simplify and assist action tubelet detection. It Table 5 shows the experimental results of the three datasets
is composed of three crucial head branches: (1) a center for the ACAM model and the MOC model. The results
branch for instance center detection and action recognition, show that detecting actions with our dataset is challenging.
(2) a movement branch for movement estimation at adja- As mentioned in Sect. 4.2, due to the particularity of
cent frames to form the trajectories of moving points, and classrooms, such as occlusions, the presence of multiple
(3) a box branch for spatial extent detection by directly subjects, large differences in subject sizes, and small dif-
regressing the bounding box size at each estimated center. ferences among action categories, detecting actions by
These three branches work together to generate the tubelet using the algorithm is highly difficult, which indicates the
detection results, which could be further linked to yield need for increased requirements in algorithmic research. In
video-level tubes with a matching strategy. addition, Table 6 shows the performance comparison
We choose DLA34 [63] as our backbone. The frame is between the two models, which provides a benchmark for
resized to 900 9900. The spatial downsampling ratio R is subsequent research.
set to 4, and the size of the resulting feature map is
225 9225. We use Adam with a learning rate of 5e-4 to 5.3 Temporal detection
optimize the overall objective. The learning rate adjusts for
convergence on the validation set, and it decreases by a 5.3.1 Model and setting
factor of 10 when the model performance saturates. The
maximum number of iterations is set to 20 epochs. We The models utilized for the baseline are the BSN [65] and
employ a batch size of 2 per GPU, and we have 2 Nvidia the DBG [66] models.
1080 Ti’s (total batch size of 4).
BSN model
The BSN model consists of three modules: a temporal
5.2.2 Datasets and metrics
evaluation module, a proposal generation module, and a
proposal evaluation module. Based on the extracted image
The J-HMDB dataset consists of 928 videos in 21 action
feature sequence obtained by the two-stream network, the
classes. Each video consists of 15–40 frames that span
temporal evaluation module (TEM) in the BSN uses three
1–2 s, and the time span of one action spans the entire
temporal convolution layers to generate the action start
video clip. Thus, each video is only marked as an action
probability sequence, the action end probability sequence,
class. The atomic visual actions (AVA) dataset consists of
and the action probability sequence. From the TEM results,
386 K video clips taken from 430 15-min movie clips at 1
candidate time nodes can be obtained. Then, the candidate
FPS. The dataset consists of a 55:15:30 split, which yields
start time node and the candidate end time node are com-
211 K training segments, 57 K validation segments and
bined to retain the fragments with the required durations as
118 K test segments. Each clip may contain multiple
temporal action proposals and generate a proposal-level
subjects, and each subject has at least one of the 80 total
feature (boundary-sensitive proposal feature) in the PGM
action classes.
(proposal generation module). In the proposal evaluation
The evaluation metric is the frame-level mean average
module (PEM), a multilayer perceptron (MLP) model is
precision (frame-mAP) with an IOU threshold of 0.5, as
described in [21]. The frame-mAP measures the area under
Table 5 Comparison between the two datasets and our dataset with
the precision-recall curve of the detections for each frame.
the ACAM model and the MOC model in terms of frame-mAP
A detection is correct if the intersection-over-union with
the ground truth at that frame is greater than r and the J-HMDB (%) AVA (%) Our dataset (%)
action label is correctly predicted [64]. MOC 74 – 11.7
ACAM 78.90 22.67 15.84

123
8348 Neural Computing and Applications (2021) 33:8335–8354

Table 6 Performance comparison between the MOC model and the action categories, and the videos are divided into training,
ACAM model on our dataset validation and testing sets by the ratio 2:1:1.
Parameters Train time (h) Test time (s) Results (%) A proposal is regarded as a true positive if it has a
temporal intersection-over-union (tIoU) with a ground-
MOC 20,544,180 78.1 1980 11.7
truth segment that is greater than or equal to a given
ACAM 12,290,234 23.3 900 15.84 threshold. The average recall (AR) is defined as the mean
of all recall values with diverse IoU thresholds between 0.5
and 0.95 with a step size of 0.05. The average number of
proposals (AN) is defined as the total number of proposals
used to estimate the confidence score for each proposal. divided by the number of videos in the testing subset. We
Finally, the soft-NMS algorithm is employed to suppress consider 100 bins for the AN. We evaluate the AR with
overlapping results. AN on the datasets to evaluate the relation between recall
For visual feature encoding, we use the same strategy as and the number of proposal, which is represented as
that employed in [65]. During training, the temporal eval- AR@AN. Moreover, the area under the AR versus AN
uation module is trained with a batch size of 16 and a curve (AUC) is also used as a metric, where the AN varies
learning rate of 0.001 for 10 epochs and then trained with a from 0 to 100 with a step size of 1.
learning rate of 0.0001 for another 10 epochs. The PEM is
trained with a batch size of 256 and the same learning rate. 5.3.3 Evaluation and analysis
For the soft-NMS algorithm, we set the threshold h to 0.8
and set the parameter e in the Gaussian function to 0.75 for By calculating the average recall with different numbers of
both datasets. We implement the algorithms using Ten- proposals (AR@AN) and areas under the curve, we eval-
sorFlow on a pair of Nvidia GTX 1080Ti GPUs. uate the abilities of the BSN model and the DBG model to
DBG model generate proposals with high recall for two datasets. We
The DBG model mainly includes the following steps. enumerate the comparison results between ActivityNet-1.3
First, the dual-stream BaseNet is used to generate two and our dataset using the BSN model and the DBG model
different levels and many discriminative features. Then, a in Table 7. The results show that generating proposals with
temporal boundary classification (TBC) module is adopted our dataset is more difficult than doing so with Activi-
to provide two temporal boundary confidence maps and tyNet-1.3. As mentioned in Sect. 4.2, our foreground and
predict precise temporal boundaries. An action-aware background are visually similar, and this prevents the
completeness regression (ACR) module is designed to algorithm from generating proposals and adds high
generate an action completeness score map and provide requirements for algorithmic research. We show some
reliable action completeness confidence. Finally, the soft- failure cases of the baseline model in the temporal detec-
NMS algorithm is used to suppress overlapping results. tion module in Fig. 14. As analyzed above, because the
For visual feature encoding, we use the same strategy as foreground and background are very similar, the algorithm
that employed in [65]. During training, the DBG module is is prone to errors when detecting boundaries. In addition,
trained with a batch size of 8 and a learning rate of 0.001 Table 8 shows the performance comparison between the
for 10 epochs and then trained with a learning rate of two models, which provides a benchmark for subsequent
0.0001 for another 2 epochs. For the soft-NMS algorithm, research.
we set the threshold h to 0.8 and set the parameter e in the
Gaussian function to 0.75 for both datasets. We implement
algorithms using Pytorch on a pair of Nvidia GTX 1080Ti
GPUs.

Table 7 Comparison between ActivityNet-1.3 and our dataset on the


5.3.2 Datasets and metrics
BSN model and the DBG model in terms of the AR@AN and AUC
(%)
ActivityNet-1.3 is a large-scale video action understanding
ActivityNet-1.3 Our dataset
dataset that can be utilized to implement various video
tasks, such as action recognition, temporal detection, pro- AR@100 AUC(val) AR@100 AUC(val)
posal generation, and dense captioning. For temporal pro- BSN 74.38 66.18 54.67 40.80
posal generation, the ActivityNet-1.3 dataset contains
DBG 76.65 68.57 54.77 41.77
19,994 temporal annotated, untrimmed videos with 200

123
Neural Computing and Applications (2021) 33:8335–8354 8349

Fig. 14 The failure cases of the BSN model and the DBG model on our dataset

Table 8 Performance
Parameters Train time (h) Test time (s) Results
comparison of the BSN model
and the DBG model on our AR@100 AUC(val)
dataset
BSN 1,412,100 4.3 175.43 54.67 40.80
DBG 2,874,758 4.1 447.83 54.77 41.77

5.4 Video captioning 2012-CLS for feature extraction after resizing the features
to a size of 299 9299 and extracting 1536 dimensional
5.4.1 Model and setting semantic features of each frame from the last pooling layer.
The maximum length of a sentence is limited to 30 words.
The models utilized for the baseline are the reconstruction Each word is embedded as 468 characters. For the decoder,
network (RecNet) [67] and HACA [68]. the input dimension is 468, while the hidden layer contains
512 units. For the reconstructor, the inputs are the hidden
RecNet model
states of the decoder and thus have dimensions of 512, and
The model consists of three parts: an encoder, a decoder,
the dimension of the hidden layer is set to 1536. During
and a reconstructor. The main purpose of the encoder is to
training, AdaDelta is employed for optimization. During
extract the characteristics of videos. The decoder employs
testing, a beam search with a size of 5 is performed for the
long short-term memory (LSTM) to convert video features
final caption generation. We implement the algorithms
into sentences. To further exploit the global time infor-
using Pytorch on a single Nvidia GTX 1080Ti GPU.
mation of the video, a time attention mechanism is
employed to encourage the decoder to select the most HACA model
important information. The reconstructor is built on top of The HACA model is an encoder-decoder framework
the encoder-decoder combination to reproduce the video comprising multiple hierarchical recurrent neural networks.
features from the sentences generated by the decoder. Via Specifically, in the encoding stage, the model has one
this reconstruction process, the decoder is encouraged to hierarchical attentive encoder for each input modality, and
embed much information from the video. Therefore, the the encoder learns and outputs both the local and global
relationship between the video sequence and the caption representations of the modality.
can be further enhanced, thereby improving the video In the decoding stage, the model employs two cross-
captioning performance. modal attentive decoders: the local decoder and the global
For the encoder, we extract the 28 equally spaced fea- decoder. The global decoder attempts to align the global
tures from one video; if the number of features is less than contexts of different modalities and learn the global cross-
28, we apply zero vectors to pad them and then feed them modal fusion context. Correspondingly, the local decoder
into Inception-V4, which is pretrained on the ILSVRC- learns a local cross-modal fusion context, combines it with

123
8350 Neural Computing and Applications (2021) 33:8335–8354

the output from the global decoder, and predicts the next Compared with the other two datasets, the description in
word. our dataset is more reflective of the content in the video.
The maximum number of frames is 40. For the visual We enumerate the comparison results of the three datasets
hierarchical attentive encoders, the low-level encoder is a using the RecNet model and the HACA model in Table 9.
bidirectional LSTM with a hidden layer dimension of 512, Our dataset is more complex and realistic than other
and the high-level encoder is an LSTM with a hidden layer datasets, and this makes the algorithm more demanding.
dimension of 256 and a chunk size s of 10. The global Figure 16 shows some failure cases of the models on our
decoder is an LSTM with a hidden layer dimension of 256, dataset. The first video is about playing with mobile
and the local decoder is an LSTM with a hidden layer phones, but it is described as ‘‘taking notes’’, and the
dimension of 1024. The maximum step size of the decoders second video does not include yawning, but it is described
is 16. We use word embeddings of size 512. Moreover, we as containing this behavior. Due to the special character-
adopt the dropout procedure with a value of 0.5 for regu- istics of the classroom scene and the length of the caption,
larization. The AdaDelta optimizer is used with a batch the algorithm is likely to lose important information during
size of 64. The learning rate is initially set as 1 and then the captioning process. In addition, Table 10 shows the
reduced by a factor of 0.5 when the current CIDEr score performance comparison between the two models, which
does not surpass the previous best for the previous 4 provides a benchmark for subsequent research.
epochs. The maximum number of epochs is set as 50. We
implement the algorithms using Pytorch on a single Nvidia
GTX 1080Ti GPU. 6 Conclusion

5.4.2 Datasets and metrics Analyzing the learning and mental states of students
through their behavioral information are important for
In this subsection, our dataset is compared with two teaching evaluations and for designing improved instruc-
extensively employed video captioning benchmark data- tion plans. However, it is tedious and costly to record
sets: the MSVD dataset and MSR-VTT dataset. MSVD is students’ in-class behaviors through traditional manual
an open-domain dataset for video captioning that covers methods, such as field observations and questionnaire
various topics, including sports, animals, and music. The surveys. It would be truly remarkable if AI technology
dataset contains 1970 YouTube video clips that are each could provide automatic and accurate detection and
labeled with approximately 40 English descriptions col- description of students’ classroom behaviors to provide a
lected by AMT. We split the dataset into contiguous groups basis for teaching evaluations and improving instruction
of videos by index number: 1200 videos for training, 100 quality. Unfortunately, the lack of corresponding datasets
videos for validation and 670 videos for testing. The MSR- limits the study and application of AI technology in the
VTT dataset is a set of 10,000 video clips (6513 video clips education field. Based on these findings, this study con-
for training, 497 video clips for validation, and the structed a comprehensive dataset that can be used to rec-
remaining 2990 video clips for testing) from a commercial ognize, detect, and caption student classroom behaviors,
video search engine. The dataset has twenty categories, and this can provide a data foundation for AI research in
such as music, sports, and movies; each category has 20 the field of teaching. Moreover, we also provided a base-
human-annotated reference captions. Metrics: We use line for each module of the dataset.
several standard automated evaluation metrics: METEOR Compared with other datasets, our dataset has four main
[69], BLEU-4 [70], CIDEr-D [71], and ROUGE-L [72]. All contributions: (1) To the best of our knowledge, the pro-
the metrics are computed by using the Microsoft COCO posed dataset fills the gaps in student classroom behavior
[59] evaluation server. Due to the limited space in this research under teaching scenarios. (2) Classroom teaching
paper, specific indicator definitions and calculation meth- is a complex scenario that introduces many new challenges
ods can be found in [59]. to the algorithm and thus encourages people to explore
improved approaches for meeting real-world needs. (3) The
5.4.3 Evaluation and analysis proposed dataset is a comprehensive dataset that contains
recognition, detection, and captioning tasks, which can
Figure 15 provides some example results for the test sets of meet the research requirements of various tasks. (4) We
the three datasets. An excessively short description may provided a comprehensive analysis of the dataset in the
neither accurately describe a video nor express the content paper and presented a baseline to verify our analysis. In
of a video in a rich language. The results show that our addition, each baseline model is evaluated and compared in
dataset produces a more complex sentence description that detail, thereby providing a multidimensional comparison
is closer to the human level than those of the other datasets. for algorithms using this dataset.

123
Neural Computing and Applications (2021) 33:8335–8354 8351

Fig. 15 Examples of sentence generation using the RecNet method with the three different datasets

Table 9 Performance evaluation of different datasets with the RecNet model and the HACA model in terms of BLEU-4, METEOR, ROUGE-L,
and the CIDEr scores (%)
MSVD MSR-VTT Our dataset
BLEU- METEOR ROUGE- CIDEr BLEU- METEOR ROUGE- CIDEr BLEU- METEOR ROUGE- CIDEr
4 L 4 L 4 L

HACA – – – – 39.6 27.4 59.7 45.8 28.4 23.1 48.1 43.1


RecNet 52.3 34.1 69.8 80.3 39.1 26.6 59.3 42.7 34.7 25.1 50.3 43.5

a girl in a grey jacket and glasses is taking notes with her head down.

a girl in a black coat is listening and yawns.

Fig. 16 Some failure cases of the HACA model and the RecNet model on our dataset

123
8352 Neural Computing and Applications (2021) 33:8335–8354

Table 10 Performance
Parameters Train time (h) Test time (s) Results
comparison of the HACA model
and the RecNet model on our BLEU-4 METEOR ROUGE-L CIDEr
dataset
HACA 25,103,345 4.9 468 28.4 23.1 48.1 43.1
RecNet 11,698,968 3 294 34.7 25.1 50.3 43.5

Our plan for future research is described as follows: with 9. Breitenstein MD, Reichlin F, Leibe B, Koller-Meier E, Gool L
regard to the datasets, we will discuss and analyze how to (2009) Robust tracking-by-detection using a detector confidence
particle filter. In: Proceedings of the IEEE international confer-
provide a research basis for using AI to solve educational ence on computer vision, pp 1515–1522
problems, such as by analyzing and building related data- 10. Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional
sets. Regarding the research method, it can be seen from neural networks on graphs with fast localized spectral filtering.
the baseline experiment results that neither efficiency nor In: Advances in neural information processing systems,
pp 3844–3852
effectiveness can meet practical applications at present. 11. Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional
Therefore, we will actively study and solve the current neural networks for matching image and sentence. In: Proceed-
problems of algorithms using this dataset such that it can be ings of the IEEE international conference on computer vision,
applied to improve the quality of education. pp 2623–2631
12. Wang J, Liu W, Kumar S, Chang S (2016) Learning to hash for
indexing big data: a survey. Proc IEEE 104(1):34–57
13. Liu W, Zhang T (2016) Multimedia hashing and networking.
IEEE Multimed 23:75–79
Compliance with ethical standards 14. Song J, Gao L, Liu L, Zhu X, Sebe N (2018) Quantization-based
hashing: a general framework for scalable image and video
Conflict of interest The authors declared that they have no conflicts of retrieval. Pattern Recogn 75:175–187
interest with regard to this work. We declare that we do not have any 15. Wang J, Zhang T, Song J, Sebe N, Shen H (2018) A Survey on
commercial or associative interest that represents a conflict of interest Learning to Hash. IEEE Trans Pattern Anal Mach Intell
in connection with the work submitted. 40(4):769–790
16. Haijun Z, Yuzhu J, Wang H, Linlin L (2019) Sitcom-star-based
clothing retrieval for video advertising: a deep learning frame-
work. Neural Comput Appl. https://doi.org/10.1007/s00521-018-
References 3579-x
17. Ma L, Lu Z, Li H (2016) Learning to answer questions from
1. Wang L, Qiao Y, Tang X (2015) Action recognition with tra- image using convolutional neural network. Assoc Adv Artif Intel
jectory-pooled deep-convolutional descriptors. In: Proceedings of 3:16
the IEEE conference on computer vision and pattern recognition, 18. Soomro K, Zamir AR, Shah MJCe (2012) UCF101: a dataset of
pp 4305–4314 101 human actions classes from videos in the wild. arXiv:1212.
2. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural 0402
networks. In: Proceedings of the IEEE conference on computer 19. Kuehne, H, Jhuang, H, Garrote, E, Poggio, T.A., Serre, T (2011)
vision and pattern recognition, pp 7794–7803 HMDB: a large video database for human motion recognition. In:
3. Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyra- Proceedings of the IEEE international conference on computer
mid network for video action recognition. In: Proceedings of the vision, pp 2556–2563
IEEE conference on computer vision and pattern recognition, 20. Chen D, Dolan WB (2011) Collecting highly parallel data for
pp 1529–1538 paraphrase evaluation. In: Proceedings of the 49th annual meet-
4. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg ing of the association for computational linguistics: human lan-
AC (2016) SSD: single shot multibox detector. In: European guage technologies, pp 190–200
conference on computer vision. Springer, pp 21–37 21. Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, Li Y,
5. Lu X, Li B, Yue Y, Li Q, Yan J (2019) Grid r-cnn. In: Pro- Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R (2018)
ceedings of the IEEE conference on computer vision and pattern Ava: a video dataset of spatio-temporally localized atomic visual
recognition, pp 7363–7372 actions. In: Proceedings of the IEEE conference on computer
6. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only vision and pattern recognition, pp 6047–6056
look once: unified, real-time object detection. In: Proceedings of 22. Wang X, Wu J, Chen J, Li L, Wang YF, Wang WY (2019)
the IEEE conference on computer vision and pattern recognition, VATEX: a large-scale, high-quality multilingual dataset for
pp 779–788 video-and-language research. In: Proceedings of the IEEE inter-
7. Shaoqing Ren, Kaiming He,Ross Girshick,Jian Sun (2016) Faster national conference on computer vision, pp 4580–4590
R-CNN: towards real-time object detection with region proposal 23. Schuldt C, Laptev I, Caputo B (2004) Recognizing human
networks. In: IEEE transactions on pattern analysis and machine actions: a local SVM approach. In: International conference on
intelligence. arXiv:1506.01497 pattern recognition, pp 32–36
8. Berclaz J, Fleuret F, Fua P (2006) Robust people tracking with 24. Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007)
global trajectory optimization. In: Proceedings of the IEEE Actions as space-time shapes. IEEE Trans Pattern Anal Mach
conference on computer vision and pattern recognition, Intel 29(12):2247–2253
pp 744–750

123
Neural Computing and Applications (2021) 33:8335–8354 8353

25. Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: 42. Mettes P, Van Gemert JC, Snoek CG (2016) Spot on: action
Proceedings of the IEEE conference on computer vision and localization from pointly-supervised proposals. In: European
pattern recognition, pp 2929–2936 conference on computer vision. Springer, pp 437–453
26. Over P, Fiscus J, Sanders G, Joy D, Quénot G (2013) TRECVID 43. Rohrbach M, Amin S, Andriluka M, Schiele B (2012) A database
2013: an overview of the goals, tasks, data, evaluation mecha- for fine grained activity detection of cooking activities. In: Pro-
nisms and metrics. In: TRECVID 2013 workshop participants ceedings of the IEEE conference on computer vision and pattern
notebook papers. http://www-nlpir.nist.gov/projects/tvpubs/tv13. recognition, pp 1194–1201
papers/tv13overview.pdf. Accessed 29 Dec 2020 44. Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just
27. Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, a few words: lingual description of videos through latent topics
Varadarajan B, Vijayanarasimhan S (2016) YouTube-8M: a and sparse object stitching. In: Proceedings of the IEEE confer-
large-scale video classification benchmark. arXiv:1609.08675 ence on computer vision and pattern recognition, pp 2634–2641
28. Goyal R, Kahou SE, Michalski V, Materzynska J, Westphal S, 45. Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M,
Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M, Schiele B (2012) Script data for attribute-based recognition of
Hoppe F, Thurau C, Bax I, Memisevic R (2017) The ‘‘something composite activities. In: European conference on computer
something’’ video database for learning and evaluating visual vision, Springer, pp 144–157
common sense. In: Proceedings of the IEEE international con- 46. Rohrbach A, Rohrbach M, Qiu W, Friedrich A, Pinkal M, Schiele
ference on computer vision, pp 5843–5851 B (2014) Coherent multi-sentence video description with variable
29. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei- level of detail. In: German conference on pattern recognition.
Fei L (2014) Large-scale video classification with convolutional Springer, pp 184–195
neural networks. In: Proceedings of the IEEE conference on 47. Zhou L, Xu C, Corso J (2018) Towards automatic learning of
computer vision and pattern recognition, pp 1725–1732 procedures from web instructional videos. In: Association for the
30. Zhao H, Torralba A, Torresani L, Yan Z (2017) SLAC: a sparsely advancement of artificial intelligence, pp 7590–7598
labeled dataset for action classification and localization. In: 48. Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset
Proceedings of the IEEE conference on computer vision and for movie description. In: Proceedings of the IEEE conference on
pattern recognition. arXiv:1712.09374 computer vision and pattern recognition, pp 3202–3212
31. Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, 49. Torabi A, Pal C, Larochelle H, Courville A (2015) Using
Yan T, Brown L, Fan Q, Gutfruend D, Vondrick C (2018) descriptive video services to create a large data source for video
Moments in time dataset: one million videos for event under- annotation research. arXiv:1503.01070
standing. In: IEEE transactions on pattern analysis and machine 50. Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video
intelligence. arXiv:1801.03150 description dataset for bridging video and language. In: Pro-
32. Kay W, Carreira J, Simonyan K, Zhang B, Zisserman A (2017) ceedings of the IEEE conference on computer vision and pattern
The kinetics human action video dataset. In: Proceedings of the recognition, pp 5288–5296
IEEE conference on computer vision and pattern recognition. 51. Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017)
arXiv:170506950 Dense-captioning events in videos. In: Proceedings of the IEEE
33. Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) international conference on computer vision, pp 706–715
Activitynet: a large-scale video benchmark for human activity 52. Zhou L, Kalantidis Y, Chen X, Corso JJ, Rohrbach M (2019)
understanding. In: Proceedings of the IEEE conference on com- Grounded video description. In: Proceedings of the IEEE con-
puter vision and pattern recognition, pp 961–970 ference on computer vision and pattern recognition,
34. Idrees H, Zamir AR, Jiang Y-G, Gorban A, Laptev I, Sukthankar pp 6578–6587
R, Shah M (2017) The THUMOS challenge on action recognition 53. Gella S, Lewis M, Rohrbach M (2018) A dataset for telling the
for videos ‘‘in the wild’’. Comput Vis Image Underst 155:1–23 stories of social media videos. In: Proceedings of the 2018 con-
35. Yeung S, Russakovsky O, Jin N, Andriluka M, Mori G, Fei-Fei L ference on empirical methods in natural language processing,
(2018) Every moment counts: dense detailed labeling of actions pp 968–974
in complex videos. In: Proceedings of the IEEE conference on 54. Aafaq N, Mian A, Liu W, Gilani SZ, Shah M (2019) Video
computer vision and pattern recognition, pp 375–389 description: a survey of methods, datasets, and evaluation met-
36. Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta rics. ACM Comput Surv 52(6):1–37
(2016) A hollywood in homes: crowd sourcing data collection for 55. Zeng K-H, Chen T-H, Niebles JC, Sun M (2016) Generation for
activity understanding. In: European conference on computer user generated videos. In: European conference on computer
vision. Springer, pp 510–526 vision. Springer, pp 609–625
37. Ke Y, Sukthankar R, Hebert M (2005) Efficient visual event 56. Wei Q, Sun B, He J, Yu LJ (2017) BNU-LSVED 2.0: Sponta-
detection using volumetric features. In: Proceedings of the IEEE neous multimodal student affect database with multi-dimensional
international conference on computer vision, pp 166–173 labels. Sig Process Image Commun 59:168–181
38. Yuan J, Liu Z, Wu Y (2009) Discriminative subvolume search for 57. Wang Z, Pan X, Miller KF, Cortina KSJC, Education (2014)
efficient action detection. In: Proceedings of the IEEE conference Automatic classification of activities in classroom discourse.
on computer vision and pattern recognition, pp 2442–2449 Comput Educ 78:115–123
39. Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio- 58. Sun B, Wei Q, He J, Yu L, Zhu X (2016) BNU-LSVED: a
temporal maximum average correlation height filter for action multimodal spontaneous expression database in educational
recognition. In: Proceedings of the IEEE conference on computer environment. In: Optics and photonics for information processing
vision and pattern recognition, pp 1–8 X, international society for optics and photonics, p 997016
40. Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards 59. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D,
understanding action recognition. In: Proceedings of the IEEE Dollár P, Zitnick CL(2014) Microsoft coco: common objects in
international conference on computer vision, pp 3192–3199 context. In: European conference on computer vision. Springer,
41. Weinzaepfel P, Martin X, Schmid C (2016) Towards weakly- pp 740–755
supervised action localization. arXiv:1065.05197 60. Simonyan K, Zisserman (2014) A Two-stream convolutional
networks for action recognition in videos. In: Advances in neural
information processing systems, pp 568–576

123
8354 Neural Computing and Applications (2021) 33:8335–8354

61. Ulutan O, Rallapalli S, Srivatsa M, Torres C, Manjunath B (2020) 68. Wang X, Wang YF, Wang WY (2018) Watch, listen, and
Actor conditioned attention maps for video action detection. In: describe: globally and locally aligned cross-modal attentions for
The IEEE winter conference on applications of computer vision, video captioning. In Proceedings of the 2018 conference of the
pp 527–536 north american chapter of the association for computational lin-
62. Li Y, Wang Z, Wang L, Wu G (2020) Actions as moving points. guistics: human language technologies, pp 795–801
In: Proceedings of the European conference on computer vision. 69. Denkowski M, Lavie (2014) A Meteor universal: language
arXiv:2001.04608 specific translation evaluation for any target language. In: Pro-
63. Yu F, Wang D, Shelhamer E, Darrell T (2018) Deep layer ceedings of the ninth workshop on statistical machine translation,
aggregation. In: Proceedings of the IEEE conference on computer pp 376–380
vision and pattern recognition, pp 2403–2412 70. Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method
64. Gkioxari G, Malik J (2015) Finding action tubes. In: Proceedings for automatic evaluation of machine translation. In: Proceedings
of the IEEE international conference on computer vision, of the 40th annual meeting of the association for computational
pp 759–768 linguistics, pp 311–318
65. Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: boundary 71. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: con-
sensitive network for temporal action proposal generation. In: sensus-based image description evaluation. In: Proceedings of the
Proceedings of the European conference on computer vision, IEEE conference on computer vision and pattern recognition,
pp 3–19 pp 4566–4575
66. Lin C, Li J, Wang Y, Tai Y, Luo D, Cui Z, Wang C, Li J, Huang 72. Lin C-Y (2004) Rouge: a package for automatic evaluation of
F, Ji R (2020) Fast learning of temporal action proposal via dense summaries. In: Text summarization branches out, pp 74–81
boundary generator. In: Association for the advancement of
artificial intelligence. https://doi.org/10.22648/ETRI.2020.J. Publisher’s Note Springer Nature remains neutral with regard to
350303 jurisdictional claims in published maps and institutional affiliations.
67. Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network
for video captioning. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp 7622–7631

123

You might also like