Professional Documents
Culture Documents
Rollback Based Active Semi-Super Vised Learning For Action Recognion - Pageindexforupload
Rollback Based Active Semi-Super Vised Learning For Action Recognion - Pageindexforupload
August 2019
August 2019
지도교수 이필규
ii
Doctoral Thesis
August 2019
A Dissertation
Submitted to the Department of Computer Engineering and the Graduate School
of Inha University in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
iii
This certifies that the dissertation of
MINHAZ UDDIN AHMED is approved
August 2019
Dissertation Committee
iv
Abstract
This thesis studies the problem of recognizing human actions from a video and still images. Human
action recognition from video can have two scenarios such as single person action recognition and
multiple person involvement. Moreover, human action can be physical body movement, such as
walking, reading or it can be interaction with the environment or with the object for a specific purpose
such as using a computer, taking photographs. In addition, human action recognition specifies the
present state of the particular person and location. We became motivated about this research work due
to many emerging real-world applications.
In order to recognize human actions, we present a method known as the rollback based active semi-
supervised learning algorithm which mainly focused on single person action classification and two-
person interaction recognition. Labeling human actions from a large volume of video frame is
expensive and time consuming. Active learning combines with semi-supervised learning can
successfully reduce the labeling time when classifying human actions in a video and images. Human
action recognition is challenging because of complex backgrounds, object scale, variations in pose, and
the misinterpretation of actions. To overcome these challenges, we prepare our dataset in two
categories such as tentatively labeled data that exploits a number of pre-trained network models and
batch labeled data for training and testing. Our propose human action recognition method uses an
ensemble of multiple deep convolutional neural network (CNN) models for the training of the data
discussed details in Chapter-3 and Chapter-4. The integrated ensemble learning along with the active
semi-supervised learning method helps to improve the overall human action recognition accuracy.
We study a limited number of actions for both single person action recognition and two-person
interaction recognition, such as using a computer, jumping, reading, phoning, taking a photograph,
hugging, fighting, linking arms, talking, and kidnapping in two environments that range from simple to
complex. From the Intelligent Technology Laboratory (ITLab), Inha University, we gather two types of
data such as simple environment (with cluttered free backgrounds) data and complex environment
(cluttered background and complex) data. We achieve 95.6% accuracy for simple environment data
and 81% accuracy for complex environment data. We also conduct extensive experiments on human
action recognition benchmark datasets and obtain better performances than other state-of-the-art
v
approaches. We show that our approach is a state of the art method that requires the joint processing of
several learning algorithms for successful human action recognition. We showed the detailed
experiment results in Chapter-5.
vi
Acknowledgements
It has been my very great privilege to know and work with my advisor Professor Phill Kyu Rhee in
various capacities over the past seven years. I would like to thank Professor Phill Kyu Rhee for his
great guidance and timely support. I am grateful to have had Professor Phill Kyu Rhee as an advisor.
I would like to thank Dr. Md Rezaul Bashar, who has given me excellent guidance throughout my
Ph.D studies. I would like to thank Dr. Musarrat Hasan, Dr. S. M. Riazul Islam, Dr. Abol Ghasem, and
Chad Pozsgay for their encouragement and support during my time of study.
I am grateful to Kim Jin Woo, Kim Yeong Hyeon, Han shibo and other Intelligent Technology Lab
members, Department of Computer Engineering, Inha University for providing me support on many
occasions. I would like to express my appreciation to all of my friends and lab members for their
support.
I am obligated to my parents for their sacrifice for me. My younger brothers and sister’s unconditional
love and support has helped me to move forward during my PhD studies in Korea.
I have to thank many people, especially the Incheon Talk House members and the Juan Rotary Club
members for their love and kindness towards me. They have made my experience at Inha so pleasant.
vii
Table of Contents
Abstract…………………………………………………………………… i
Acknowledgement………………………………………………………… v
Table of contents………………………………………………………….. vi
List of Tables……………………………………………………………… x
List of Figures…………………………………………………………….. xi
1 Introduction
1.3 Motivations……………………………………………………………… 20
2 Related Work
viii
2.8.2 Inception………………………………………………………………… 34
2.9 Optimizer………………………………………………………………….. 34
2.9.1 Adam……………………………………………………………………… 34
2.9.2 RMSprop………………………………………………………………….. 35
2.11 Summary…………………………………………………………………. 38
3.7 Summary…………………………………………………………………. 53
ix
4.3.1 Person Detection………………………………………………………… 59
5.1.6 Evaluation………………………………………………………………… 68
5.2.5 Evaluation………………………………………………………………… 76
5.3.2 Compare with the state of the art Methods on KTH action dataset 84
5.3 Summary………………………………………………………………… 85
x
6 Conclusion and Future Research
xi
List of Tables
5.1 Number of images used for training on the IT Lab Action dataset………………….. 68
5.2 Performance with five different actions from the PASCAL VOC 2012 dataset ……. 72
5.3 Comparison between the IT Lab dataset and the Stanford 40 Actions dataset............ 73
5.4 Recent different methods and performance results on the Stanford 40 dataset............ 73
xii
List of Figures
Figure 2.1 A sample human action classification where a person using a computer…….. 26
Figure 2.2 A convolutional neural network. The network takes a single input image and 33
computes N outputs. Each output is linked with a diverse loss function………
Figure 5.1 Five example of action categories: (a) to (e), respectively, jumping, phoning, 64
taking a photo, using a computer, and reading, from the PASCAL VOC 2012
action recognition dataset. These classes illustrate human actions strongly
trained on a pose…………………………………………………………….
Figure 5.2 Five example action categories: (a) to (e), respectively, jumping, phoning,
reading, taking a photo and using a computer from the Stanford 40 action 65
dataset. Amongst 40 actions, five actions are selected……………………….
Figure 5.5 Human action recognition performance on the IT Lab dataset, which uses 69
both semi-supervised learning (SSL) and active learning (AL) steps….............
Figure 5.6 Shows the number of epochs with training and validation errors……………... 70
Figure 5.8 Human Action performance analysis comparing the proposed Effective 71
Adaptive Deep Learning with the baseline results on PASCAL VOC action
dataset………………………………………………………………………….
Figure 5.10 Example of HMDB dataset where action category is punching and hugging… 75
xiii
Figure 5.11 Two-person action in simple environments of the ITLab action dataset…….. 76
Figure 5.13 Two-person action in a ITLab dataset with cluttered environment ………….... 79
Figure 5.17 The information flow of the rollback based EER- ASSL and SSL……………. 84
A.3 Figure. A.3 using active learning tool for human action recognition. Here the 91
action is using a computer……………..……………………………………….
A.4 Figure. A.4 Training human action dataset using matconvnet library [38] 92
based ASSL tool. Here the first data set shows the poor performance………
A.5 Figure. A.5 shows training performance of four set of data where set 1 and set 93
2 produce the same result and set 4 produce poor result as a result rollback to
set3…...
A.6 Figure. A.6 shows training performance after incremental learning. Here the 94
performance reaches the saturation………………………………….………
xiv
Chapter 1
Introduction
Human action recognition is an important problem because it can identify real-life personal interactions,
personal movement, and particular tasks carried out by a person, which facilitates better decision
making. Still pictures or video contain a significant amount of information on human actions.
Interpreting and understanding these actions from images help to resolve real-life problems, such as
surveillance video analysis for safety, elder patient monitoring, sports video analysis, detection of
can easily understand the activity of a person by observing with our human vision system, such as open
eyes. However, it is very expensive to use human labor to monitor the various human actions in a real
world scenarios for a long time for example visual surveillance. Therefore, many researchers main
objective is to develop a machine that can recognize human actions and serve us. Human action can be
classified several types such as simple body movement that can be running, reading, walking, jumping
or interaction with environment with object or specific purpose that can be using computer, taking
photograph, phoning, playing an instrument etc. or human action may include complex body
movement and multiple person involvement. Human action recognition is a very challenging research
area due to number of factors involve such as person’s pose variation, complex background, object
scale, occlusion, camera motion, miss interpretation of action, and large number of unlabeled action
datasets. Currently, human action recognition is a very popular topic due to numerous real world
application such as security surveillance that may include airport, bus station, railway station, content
15
based browsing, such as fast-forward to find out the important scene, monitoring elderly people
especially in hospital, video recycling, such as eliminating harmful video content from children.
The image classification problem such as human action recognition has gained substantial attention
from researchers for the past two decades. During this period classical machine learning approaches
have been proposed in order to solve these problems such as probabilistic modeling (Naïve Bayas),
Early Neural Networks, kernel methods (Support Vector Machine) and decision tree. Since the
beginning these algorithms exhibited state of the art performance for simple a classification task [1].
Before the ImageNet challenge[1] emerged, these handcrafted methods were sufficient enough to
handle the early classification problems. However, these methods have been proved hard to scale up to
large datasets. After a few years of ImageNet challenges[1], the deep learning method replaced many
of these algorithms such as SVM, and the decision tree in a wide area of applications. Many of the
image classification challenge such as MS COCO [2], ImageNet, provide a large amount of training,
testing, and validation data to solve large scale image classification challenges.
Human action recognition through video in real time has become necessary to develop a faster strategy
to recognize human action mainly in security surveillance area, sports industry, and healthcare system.
An accurate video data analysis algorithm emerged as an important research area with a number of
applications in real time. From video data, a Human action recognition consisted of two things.
Initially there was the goal of human detection and tracking. Secondly, there was a need for classifying
the action of others. When the first task is done and that person is successfully detected, then other
information such as pose and motion determines the particular action that person might be going to
16
take. Thus the combination of multiple pieces of information helps us to determine the particular action
Previously human activity recognition was mostly based on the fixed camera position at a specific
angle. Most of the time the viewpoint of the person and background were fixed in that situation
therefore the background subtraction method worked well due to simplicity. Usually, the video
sequence is subtracted from the background image in order to get the foreground information. However,
the major problem of background subtraction is that it is sensitive to illumination changes. Moreover,
for the moving camera object segmentation is more challenging due to the motion of the target object
and motion of the camera. Because the background is changing frequently, the background model is
not appropriate for segmentation. One of the popular methods for moving a camera object is finding
the temporal difference of the image frames [3]. Also it is required to calculate the pixel-level motion
between two images. Optical flow can be used to find out the motion of a moving camera that
approximates the pixel level motion of two images [4]. In order to segment the person from video data
a number of features are considered such as shape, color, and motions, space-time information,
frequency transformations, local descriptors, and body modeling [5]. One of the major problems of
Space-time information is that it is only available for non-periodic human activity recognition.
Moreover, Space time volume usually considers the entire image therefore it is limited to a view point
change and occlusion [6]. On the other hand, scale-invariant feature transform (SIFT) [7] and
Histogram oriented gradient (HOG) [8] are ideally invariant to background clutters, appearance and
occlusions, rotation and scales in certain cases but do not fully capture the entire body actions. As a
result, many researcher proposed a human modeling method where 2D body modeling and 3D body
modeling are measured. For body modeling conventional optical systems commonly use markers to get
17
human motion where users wear optical markers. It helps to locate the position of the human body
parts where the markers are attached. In order to remove the effects of occlusion a number of
additional cameras are installed at different locations to make sure to have the full coverage around the
human body. Here, the total number of cameras can be several hundred which is expensive [9].
One of the common temporal classification methods is the Dynamic Time Warping (DTW) method
which measures the similarity between two temporal sequences and it is popular for its simplicity but
not suitable for a large number of human action recognition classes with many variations [10]. There
are probability based methods such as Hidden Markov Model (HMM) [11], Dynamic Bayesian
Network (DBN)[12] and discriminative models such as Support Vector Machine (SVM) and its
performance is heavily dependent on an extensive training dataset [13]. A number of methods have
been applied to recognize human actions. However, most of these methods, such as histogram of
oriented gradients (HOG)[8] , Bag of Features (BOF)[14], and space–time interest point (STIP)[15] are
recent years, the deep convolutional neural network (CNN) has been extensively used for computer
vision applications, especially human action detection [16],[17], because of its higher accuracy and
appropriate application to real-life problems. The deep CNN learns features automatically and better
than hand-crafted features. Building a deep CNN architecture from scratch is a challenging task;
however, the transfer learning [18] technique enables information transfer, which reduces the burden of
rebuilding the architecture and time. However, producing a positive transfer between appropriately
related tasks, while avoiding negative transfers between tasks that are less related, is an important issue.
In order to overcome this challenge, researchers select more accurate state-of-the-art models from
ImageNet[1] challenges, such as AlexNet [19] and VGGNet[20], are already trained with thousands of
18
objects at higher accuracy and lower error rates. These pre-trained models improve object recognition
performance more than previous approaches, such as the Deformable Part Model [21]
Human action recognition is a very challenging task where activities are complex and diverse as well
as the number of people involved can be single or multiple in a particular event [22]. Many of the
activities are highly correlated with motion therefore the precise point of visual interest plays an
important role. Another major challenge is to recognize human action in real time using the deep
learning architecture because it requires continuous fine-tuning of the network in order to learn the
changing appearance of the target [23]. While using the Deep learning approach, on many occasions
spatial and temporal signal get coupled with each other by 3D convolution and consequently it
becomes hard to optimize the network due to dozens of 3D convolution layers. Moreover, the memory
cost of 3D convolution is very high to unaffordable when building a 3D CNN [24]. The number of
challenges in human action recognition are a large dataset with long videos, highly variable action,
partially visible objects, varying ages, and unpredictable behavior. A common human action is
different from person to person. These types of human action consist of confusion, fear, or basic
misunderstandings of emotions and effects. Recognizing these verbal and non-verbal interactions is a
challenging task [25]. Another important challenge is the model trained in the local dataset does not
adapt well to the richness of images in the real world [25]. Socially intelligent robotics is a new
growing area where a robot can read people’s emotions, motivation, and other factors which affect
behavior. People usually know how to talk to seniors, react to an angry person or naughty child, or
cheer up a friend. In order to adopt people’s emotion for social robots, classifying human action
recognition is necessary and a particular person may thinking about something e.g. searching for an
19
1.3 Motivations
The objective of this dissertation is to make an effective system which exploits active and semi
supervised learning in order to achieve an accurate identification of a single person’s action or two
person interactions. Understanding people’s action is valuable due to numerous applications such as
surveillance, human computer interaction, health care, sports and entertainment, filtering out immoral
video content, and understanding a scene. All these applications require efficient action recognition
from video. Complex environments (crowded places, such as markets and bus stations) make it
difficult to recognize human actions. Other factors in real time include pose dissimilarity (for instance,
interactions (e.g. using a phone or a camera), low image quality, occlusions, noisy backgrounds, and
poor lighting conditions. As we discussed in section 1.2, most of the previous method approaches and
generative and discriminative models underperformed which motivates us to take the chance of
applying the state of the art deep learning method. The deep learning based approach together with
active learning (AL) have not much in common with previous methods. There are many occasions
where few labeled training points are available, but a large number of unlabeled data are given. In that
scenario, Semi-Supervised Learning(SSL) can play a big role Thus in this regard, we investigate more
on how efficiently we can use pre-trained models such as VGG [28], Inception [29], optimizer such as
adam [30], RMS [31], sample selection procedure, parameters choice such as the number of training
epoch, learning rate and, threshold values. The other goal of our work is to implement a competent
system which can work well even in variety of environments that change from simple to complex.
Finally, we expect an improvement in the recognition rate for single human action recognition, and two
20
1.4 Proposed Roll back based Active Semi-supervised Learning
In this dissertation we develop a roll back based active semi-supervised learning method that can
concurrently select the best model among the set of models by forward learning or rollback learning in
order to achieve an improved performance for human action recognition. A large number of training
datasets with properly annotated data effects the quality of the performance of human action
recognition. Labeled action recognition data is hard to get due to costly nature. Moreover, labeling data
is a very time-consuming task. We address this problem by using semi-supervised learning (SSL) as
semi-supervised learning utilizes both supervised learning (labeled data) and unsupervised learning
(unlabeled data). In our work we separate the dataset in two categories such as tentatively labeled
Later we apply active learning (AL) method in our each of the batch training dataset. As our
proposed method combined active learning (AL) and semi supervised learning (SSL) therefore we
named our framework as active semi-supervised learning (ASSL) to recognize single person human
action and two person interaction recognition from a single image and videos. A convolutional neural
network architecture cannot completely exploit the limited number of image sets. we proposed a more
refined way that combined (semi-supervised learning) SSL [32] and active learning (AL) [32][33] to
tackle this problem. Active semi-supervised learning (ASSL) takes advantage of both SSL and AL.
This training process repeats until the predictor achieves saturation. The ASSL method allows us to
accurately determine the classification outcome from misleading unlabeled data and increase the
existing model performance by combining previous data incrementally. Labeling a huge amount of
21
images is very time consuming and computationally expensive. Hence, our method selects the best
We train our model in a batch process manner where each batch dataset are trained sequentially
and result carries forward to the next batch. The first batch dataset outcome helps us to anticipate the
next batch result through comparison. If one batch performs poorly compared to the previous batch
then the training dataset is able to reflect the erroneous data. After careful observation of the erroneous
data our approach is able to show misclassified data, then the wrongly posed image or noisy image
which increases the possibility of poor performance is eliminated. With the aid of active learning, we
get rid of the noisy images by correctly labeling the image or by simply replacing it with a good one.
Consequently, this improves the batch training performance. Moreover, while using incremental
learning by setting a particular threshold value also helps shorten the labeling time by using semi-
supervised learning. Combining active learning and semi-supervised learning can be applied in several
areas such as object tracking, and object recognition. We observe that rollback based active semi-
supervised learning makes less of an error rate in training a dataset while it works well with small
number of samples which is significant for incremental learning. Subsequently, when the model gets
better it reduces the over fitting problem. However, rollback has not much of an effect when the model
reaches saturation or if the dataset is biased. We discuss details of our technique in Chapter-3 and
showed the performance results in various experiments in Chapter-4. We apply our rollback based
active semi-supervised learning in single person action recognition and two person interaction
22
In our proposed single person action recognition, we included an ensemble learning based method
that exploits the deep learning model of the VGG16 network (from the Visual Geometry Group at the
University of Oxford) with an incremental active learning framework that outperforms state-of-the-art
technologies to recognize human actions. We deployed VGG16 Network in our research due to its
promising results. We consider five actions for single person action recognition such as using a
computer, jumping, reading, phoning and taking a picture for simple and complex environment. We
applied our method in two different benchmark datasets such as PASCAL VOC 2012 dataset and
In our proposed two person interaction recognition method, we use a Faster R-CNN [34] based action
recognition framework using the matconvnet [35] library and a trained CNN for each action. Two
person interaction recognition is challenging because due to the small training dataset. The unique
actions we have used in our experiment such as kidnapping or linked arm is not common type of action
in benchmark dataset. We found two benchmark dataset such as UT-Interaction and HMDB51 have
similar actions like the ITLab two person interaction dataset actions are hugging, fighting, linked arm,
talking and kidnapping. We use these two benchmark dataset in our experiment showed in details in
Chapter-4.
23
1.5 Thesis Organization and contribution
In this dissertation we develop a rollback based active semi supervised learning method. In particular,
we develop a convolutional neural network based approach which use active semi supervised learning
In Chapter 2 we provide recent overview of the relevant background such as supervised learning,
semi-supervised learning, active learning [36], Convolutional Neural Network, Transfer learning and
In Chapter 3 we explain details overview of our proposed method and application to classify human
action recognition. The architecture is based on expected error reduction based Active semi-supervised
learning.
In Chapter 4 we explain details architecture of our rollback based ASSL system and application to
classify human action recognition from images. The architecture is based on Convolutional Neural
network with the combination of Active learning and semi supervised learning.
In Chapter 5 we present the experimental results that validating our proposed system to recognize
Human actions from image sequences and videos, two person interaction recognition. We also describe
In Chapter 6 we conclude the thesis with our contributions and discuss the remaining challenge for
future researches.
24
Chapter 2
Related work
Human Activity Recognition in images and videos has received an enormous attention in recent years.
This chapter provides the literature survey with necessary background on human action recognition,
Transfer learning, Optimizer and related work of single person action recognition, double person
Image Classification by a person is a simple task, however the same task done by computer is not a
straight forward approach and can be very challenging. While using a computer we label the number of
dataset and some predefined sets of categories. For example, the set of furniture where object can be a
table, a chair, a desk, and so on. When we select a specific recognition domain we are restricted
because the other object domains could be things such as models of vehicles. The image classification
problem can be categorized as binary classification where only two classes are involved but in the
multiclass classification several classes are involved. An example of binary classification might be
whether an image contains a person or not. On the other hand, the multiclass classification problem
might help to determine a specific action performed by a person amongst hundreds of actions. The
second type of problem is denoted as the image categorization task. Selecting the right set of image
categories is an important task for practical applications. While classifying objects, we should consider
few things such as the type of object, non-objects, inanimate things [37].
25
Fig. 2.1: A sample human action classification where a person using a computer.
Research indicates that human level performance has a five percent error rate on a scene understanding
task and also require 50 milliseconds to classify the scene. At the beginning of classification challenges,
the shallow methods were used such as Caltech 101 benchmark dataset where most categories had
about 50 images collected in September 2003 [38]. Later, the Pascal Visual Object Classes Challenge
offered more realistic high resolution images which becomes the standard [39]. Nowadays the state of
the art database for classification is the ImageNet [40] database which is organized according to the
WorldNet hierarchy. WordNet is described by multiple words or word phrases is known as “synset”.
There are more than 100,000 synset in the WordNet hierarchy. ImageNet provided 1000 images for
We prepare our batch dataset by taking help of supervised learning for labeling the dataset as our
objective is to performance improvement of the pre-trained model. The objective is to learn a mapping
from 𝑥 to 𝑦, and training set pair is (𝑥𝑖 , 𝑦𝑖 ). These pairs are sampled from distribution and mapping is
evaluated through the predictive performance. However, labeling is costly therefore we keep the
dataset small on an average 10 images per action and wanted to achieve desired performance with less
26
label data. Each batch dataset contain 50 images for five actions. Among the two families of supervised
learning we lean on generative algorithm due to its predictive nature [42]. In our framework we
followed the generative algorithm procedure that model the class conditional density 𝑃(𝑥|𝑦) by
unsupervised learning method. We discuss details the necessity of semi-supervised learning in details
There are many occasions where few labeled training points are available, but a large number of
unlabeled data are given. In this scenario, SSL can play a big role. In mathematical formulation it can
be written as the Knowledge on 𝑃(𝑥) that one gains through the unlabeled data has to carry the
information that is useful in the inference of 𝑝(𝑦|𝑥). If this condition does not fulfill SSL has no
improvement over supervised learning. There are number of assumptions useful for SSL such as
smoothness assumption, cluster assumption, low density separation, manifold assumption. If the two
points are 𝑥1 and 𝑥2 in a high-density region are close, then corresponding output should be 𝑦1 and𝑦2
[43]. SSL seeks to enhance a big amount of unlabeled sample information with set of 𝑙 images,
𝑋, SSL uses the combined 𝑥1 and 𝑦1 to better classifications [22]. On the other side, semi-supervised
learning tries to train the model from unlabeled samples [23]. Jones et al. have developed a method
called as Feature Grouped Spectral Multigraph (FGSM) that is produced by clustering [45]. Zhang et
al. [46] proposed a boosting based multiclass SSL where formulate the action recognition with
In our approach we consider self-training also known as self-learning type of semi-supervised learning.
SSL training is way of reducing the effort required to prepare the training set by training the model
27
with two types of dataset. A small number of fully labeled examples and additional set of tentatively
labeled or unlabeled examples. A model trained in this method performs better than trained in
traditional manner. In tentatively labeled training data the labeling of each of the image area can take
the procedure of a probability distribution over labels. That makes it likely to capture a variety of
information about the training examples that a specific image has a high likelihood of containing the
The specialty of this algorithm is that it can select the data which it learns from. Therefore it can
perform better with less training data [33]. One of the theoretically inspired query strategy is known as
query-by-committee. Our framework motivated by this strategy in order to achieve better performance
with less labeled data. Generally, AL achieves high performance just using a limited number of labels
and minimize the labeling cost. Because of this advantage many machine learning uses AL. However,
to overcome the labeling that is done by the supervised learning tasks and the efforts that go along with
that, are expensive to access, laborious, and difficult such as speech recognition, document
classification, Information extraction, etc. [47]. In the active learning method, [33]batch mode active
learning performs much better compared to single mode when handling a huge information because it
is less costly [48]. It is a balanced strategy where AL aims by posing queries to minimizing the labeling
job[49]. Bernard et al. conducts three part labeling strategies and they are able to identify different
Visual Interactive Labeling (VIL) approaches, investigate the strengths and weaknesses of the
aforesaid labeling strategies and to determine the impact of using different encoding to direct the visual
interactive labeling process [50]. VIL method need to use fifty labeling iterations and make a cold start
to tackle the problem of AL which can be intense. In the future, the empirical performance of VIL can
28
rely on a user based information selection method. Huo et al. [51] present a structure for the binary
classification problem known as Second Order Active Learning (SOAL) [51]. SOAL can use both First
and second order information for an efficient query approach to predict incoming unlabeled data and
We present ASSL framework which combines the deep feature we mentioned in section 2.3 and 2.4 in
Chapter-2 and batch-mode algorithm similarly [36] . We use a batch mode active learning which suite
our image sets [36][33] with little training data that cannot produce satisfactory performance, whereas
10 images per action show a better result in a single batch of the dataset. Therefore, we adjusted the
number of images for each human action in order to get a better performance. ASSL initially train deep
convolutional neural network using pre-train network and repeatedly retrains the next model by adding
new batch data. Merging the batch data with deep feature model and incremental training model is
continue until the model reaches saturation. One more circumstance is that fewer iterations shows poor
performance as a result therefore we consider minimum number of iterations at 50,000 with the
Intelligent Technology Laboratory (ITLab) dataset. Usually, a big amount of iterations needs a lengthy
training time to update the weights. We discuss more details about ASSL and ERR-ASSL in chapter-3.
Most of the active learning previous work considered three main types of sampling methods for
unlabeled instances [52]. All those assumptions are queries take the form of unlabeled instances which
29
is letter labeled by the oracle. Query Synthesis allows AL to query for any unlabeled instance.
Basically queries learner synthesizes from scratch by an algorithm that sampled instances from a
different distribution. These query selection process is optimized for number of advantages such as
improve efficiency. The model that produces data sometimes not available therefore it is not possible
to check the synthesized query is good for oracle or not ahead of time.
data. Usually, data sampled from the real world where in many cases the input distribution is uniform
and unknown. Stream-based the selective sample is suitable for real-world human action recognition as
the particular person movement in video samples are gathered at regular intervals.
In many cases, a large amount of data can be gathered at one time where pool based learning becomes
handy. The Oracle selects data from the pool to query in a greedy manner. In pool based learning
evaluates the ranks the whole dataset before selecting the best query. On the other hand, stream-based
sampling scan the entire data sequentially and make query decisions of each point separately. Pool
based sampling is very popular and studied in the literature extensively.
to the number of advantages. In this sampling procedure selecting queries by allowing the algorithm to
query instances that is least certain. Least confident: The algorithm simply looks at the posterior
probabilities for each and choose the one near to 0.5. In the case of several class labels, selecting the
30
2.7 Convolutional Neural Network
Convolutional Neural Network are special kind of neural network for processing large data especially
for computer vision such as human action recognition. CNN played an important role for deep learning
in practical application. It consists input and output layer with multiple hidden layer which contains
convolutional layers, pooling layers and fully connected layers. Mainly hold the pixel intensity of the
images. Generally, an input example which is a multi-dimensional array that consists height × weight
× three color channels (red, green, blue) E.g. a 256×256× 3 color image. Convolution layer take
images from the previous layers and specify the number of filters to create images known as output
features map. Output feature maps are equal to the number of filters. Convolution is the operation of
two functions which is a real valued argument can be viewed as matrix multiplication [53]. If we
consider a weighted function 𝑤(𝑎), where 𝑎 is the age of measurement. After applying weighted
average operation at every moment, we get a new function 𝑠(𝑡) which provide the position of the
spaceship:
Here, 𝑤 should be valid probability density function otherwise the output will not be a weighted
average. Generally, the convolutional layers apply convolution operation then send the result to the
next layer. These connected neural network is used to learn features and classify data. In the
convolutional network terminology, the first argument is considered as the input (e.g. the function 𝑥)
31
and the second argument considered as the kernel (e.g. the function 𝑤). The output sometimes known
as feature map.
Figure 2.2: A convolutional neural network. The network takes a single input image and computes N
A typical CNN consist three stages where in first stage convolution operation in order to produce a set
of linear activations. In the second stage, all the linear activation run through a nonlinear activation
function such as rectified linear activation function is also known as ReLU. In third stage pooling
function is used which contains max pooling layer which combines the maximum value from each of
Transfer learning [26] is a popular approach in deep learning that can apply one domain knowledge to
another domain. In this condition, transfer learning [26] performance significantly, without much effort
for data labeling. When the distribution changes, most of the machine learning models require
rebuilding from the scratch, using newly gather training data. The collection of new training data and
32
prepare new model is expensive. Transfer learning has applied successfully in many domains, such as
Web document classification [27]. Wu and Dietterich [28] apply transfer learning to image
classification problem that requires a large amount of labeled information, that is very costly, but
2.8.1 VGG
VGG is a deep convolutional neural network architecture is developed by Oxford Visual Geometry
group for large scale image classification where convolution filters are very small such as 3𝑥3 and
depth are 16-19 layer weight [28]. VGG achieve first and second position in ImageNet localization and
classification in 2014. During the training of this architecture the input size of the convent is 224x 224.
In order to prepare the convent input images are randomly cropped and rescaled. It has three fully
connected layers (FC). The first two layer have 4096 channels each. The third layer contain 1000
channels due to one channel for each class for ImageNet [1]classification. The final layer is the soft-
max layer.
33
2.8.2 Inception
Inception is a deep convolutional neural network architecture is developed for large scale image
classification challenge which utilizes the computer resources inside network [58]. In this architecture
depth and width of the network increased while keep the budget same. Generally, most of the sparse
connection between the activations which implies output channel with input channel. Inception module
has several versions such as naive version and dimension reduction version. In the naïve version, one
problem is that a modest number of 5×5 convolutions can be expensive with large number of filters.
The problem gets bigger when pooling layer is added. Therefore this design might cover the optimal
sparse structure however it is very inefficient. This leads to go for dimension reduction version where
1x1 convolutions are used to compute before go for 3×3 and 5×5 convolutions. Also used rectified
linear activation. For the memory efficiency it is helpful using inception models only at higher layers
and keep the lower as usual. Thus improves significant quality gain at a small increase of
2.9 Optimizer
2.9.1 Adam
Adam is a simple and computationally first order gradient based optimization algorithm [30]. It usually
optimizes stochastic objective function which require first order gradients with small memory. Adam
combines the advantage of AdaGrad [60] which works well in sparse gradient and RMSprop [61]
works well in non-stationary setting. Several advantages of Adam is that it does not require a stationary
objective. The efficiency of this algorithm improved upon by changing the order of computation. One
important feature of Adam is its step-size are selected carefully. Adam updates are calculated using a
running average of first and second moment of the gradient. Adam often outperform other methods in
34
case of multilayer Neural Network models with non-convex objective function. Adam memory
requirement which is linear in the number of mini-batches. Stochastic regularization method such as
drop out is an effective way for preventing over fitting problem. It requires less memory. Adam is
robust and suitable for wide variety of non-convex optimization problem. Adam can effectively solve
practical Deep learning application problem. Thus we included Adam in our roll back based ASSL
2.9.2 RMSprop
The common problem of big network is that if we start with big learning rate the weight of hidden unit
also become big for instance, positive and big and negative and big. In this case, due to error derivative
for the hidden units the error will not decrease. There are several ways to speed up mini-batch learning.
One way is using RMSprop which is an optimization algorithm similar to Adam [30]. RMSprop with
momentum generates its parameter update using momentum on the rescaled gradient. RMSprop lacks
bias-correction terms when 𝛽2 close to 1. In this situation, for each weight calculation RMSprop use
the sign of the gradient and adapt the step-size separately for each weight for further development [61].
Sutskever et al. [61] combined Rmsprop with Nesterov momentum which performs well. We adopted
RMSprop optimizer while train the two person interaction dataset and obtain outstanding performance
showed in chapter-4.
35
2.10 Related work of Human Action Recognition
Video-based human action recognition has received significant attention in the last two decades from
computer vision researchers [15] [62]. Number of algorithms are developed to efficiently recognize
human action from videos. Laptev et al.[63] used the Harris and Forstner interest point operators, and
detected local structures that have substantial local variations in both space and time. In that work each
event is classify with a jet descriptor. He describes a detected points that correspond to meaningful
events.
Laptev and Pérez et al. [64] applied boosted space–time window classifiers and applied human motion
and shapes within an action. And worked on realistic scenarios, such as a movie file with variations of
actions in terms of subject appearance, motion, surrounding scenes, viewing angles, and spatio-
temporal extent. Iwashita et al. [65] used global motion descriptors that first combine dense optical
flows and then local binary patterns (LBPs). Optical flows are separated into categories, and the
number of flows in each category is counted. Each scene is divided into a 𝑆 by 𝑆 grid where 𝑆 can be 3
for eight motion direction. As a result Histogram of optical flow (HOF) produced with 𝑆 by 𝑆 by 8 bins.
The descriptor in each grid is construct via optical flow for a short time interval of 0.2 seconds. Global
motion descriptors of the optical flow can easily identify body shake and ball play. LBPs appeared
based on features that contain the relations between pixel values in a neighborhood of a reference pixel
value, later applying a dimensionality reduction method. Also used were cuboids and STIP feature
detectors. For cuboids, they proposed normalized pixel values for the STIPs [66] using HOG and HOF.
They also applied a dimensionality reduction method to compute local motion descriptors for 100
dimensions.
36
Considerable advancements are done in person detection in images, such as STIP, HOG, Bag of words
(BOW)[67], the dense trajectory–based approach and the HMM-based approach. Most recently, the
CNN-based approach [20][16] has gained popularity due to its outstanding performance in the
PASCAL VOC and ImageNet challenges [1]. These two benchmark datasets contributed a lot to
improving object recognition in image and video files. Additionally, the Regional CNN (RCNN) [68],
have played important roles, whereas in our work, we focus on semi-supervised learning with an active
learning approach. Most relevant to our method is poses in human–object interaction activities[69] and
recognizing actions through action-specific person detection [70]. The continuous learning framework
uses a similar kind of dataset that used in our approach. Deep features for person detection where each
image file with object bounding–box information stored in a separate XML file has increased in many
folds due to flexibility to work with. If we consider the object detection competitions such as PASCAL
VOC, ImageNet, Microsoft Common Objects in Context (COCO). The bounding box keeps the region
Human action recognition is a wide research area where many algorithms exist. The previous human
action recognition work include manually engineered features, for instance Histogram Oriented
gradient (HOG) [8], On Space Time Interest Point (STIP) [15], dense trajectory [71], Fisher vector [72],
and Deformable Part Model (DPM) [21], Bag of Words (BOW) [73], which are popular for human
action recognition. Since the popularity of deep models like Long-term temporal convolutions (LTC-
CNN) [63] [74] that learn multiple layers and generate high-level classification, convnet [68][34]
gained a good reputation. However, CNNs [75][58] have developed a lot in recent years. We apply AL
37
[33] on top of SSL in support with number of sampling strategies. The substantial difference in our
method is iterative forward learning, backward learning, sample selection, and combines both SSL and
AL.
Many research found action patterns can be learned by Convolutional Neural Network and Recurrent
Neural Network[76] . Both CNN and RNN have advantage and disadvantages such as CNN based
method are good at learning appearance but not good at long term motion dynamics. On the other hand,
RNN can able to learn temporal motion dynamics. Lin et al. [76] proposed lattice-LSTM that extends
LSTM by independent hidden state transitions of memory cells. They enhance the LLSTM model and
provide a multi-model training process with diagonally time. Zhe Cao et al[77] proposed a popular
method that can detect multiple people from video and image in real-time. They applied
nonparametric representation that refers to Part affinity fields (PAF) in order to learn the body parts.
They also present the body and foot key point detector that work for hand and facial key point as
well[77]. Yun Han et al proposed handcrafted cued LSTM model for human action recognition that
work on 25 skeleton joints in 3D coordinates. Their work mainly based on human posture and human
2.11 Summary
In this chapter we have presented some of the state-of-art technique used in our framework and a
typical workflow the way our framework take input data for training which involve supervised learning,
semi-supervised learning, active learning, transfer learning, pre-trained network such as VGG,
38
inception, optimizer such as Adam, RMSprop and related work to classify human action. Optimization
often beneficial during the training period of our dataset such as UT Interaction [79] , HMDB [80]. We
have applied both Adam and RMSprop optimizer in our experiment. We have discuss the details of our
algorithms and methodologies in the Chapter-3 and Chapter-4 to classify different human action.
39
Chapter 3
In this chapter, we present the overview of the proposed method in detail. We explain the necessity of
the Rollback based ASSL framework. We analyze the forward and rollback based active semi
supervised learning where collaborative sampling and expected error reduction influence the model
performance. We have shown step by step the process of the Expected Error Reduction based Active
Semi-supervised Learning (EER-ASSL) method for human action classification. Finally, we discuss
Active learning is a very powerful learning methodology which can produce an efficient classifier with
a small number of labeled data [81]. In our work, we adopted active learning in a pool-based (batch)
manner which selects more informative and better training samples with less redundancy. The large
portion of our unlabeled data gathered in ITLab environment, as well as benchmarks, are then exposed
to an AL similar to Query By committee (QBC) framework in order to minimize the version space [82].
The main objective is to improve the learning performance with a minimum number of queries. Many
of the previous active learning research focuses on selecting a single unlabeled data per iteration which
is not efficient. Batch mode active learning can address this problem efficiently by training a number of
subset images together while making the classification model [83]. In our rollback based ASSL
framework, we prepared both training and test dataset in a batch model. On an average, each batch
40
contains 25 images from five action categories. We showed in details in Chapter-5 with the
experimental result.
We consider a provisionally labeled training dataset 𝐷𝐵𝑝𝑟𝑜𝑣𝑖𝑠𝑖𝑜𝑛𝑎𝑙 ={𝑥𝑖 }1𝑙 , a human action model
(softmax classifier) 𝑆, and we compute the prediction score over N classes for each bounding box given
below which is similar to Hasan et al. [17] and Rhee et al.[36]. The current classifier applied to each
image in 𝐷𝐵𝑝𝑟𝑜𝑣𝑖𝑠𝑖𝑜𝑛𝑎𝑙 and the probability that an image 𝑥𝑖 belongs to class 𝑞 defined as:
exp(𝑊𝑉𝑞𝑇 𝑥𝑖 )
𝑆(𝑥𝑖 ) = 𝑝(𝑦𝑖 = 𝑞|𝑥𝑖 ; 𝑊𝑉) = 𝑁 (3.1)
∑𝑚=1 exp(𝑊𝑉𝑚𝑇 𝑥𝑖 )
Where 𝑞 ∈ {1 … . . 𝑁} is the set of class labels and 𝑊𝑉𝑞𝑇 is the corresponding weight vector of class 𝑞
For each object class, a set of samples is sequentially selected and scored 𝑆(𝑥𝑖 ) using the current
classifier. The region of interest point (ROI) part is preserved for both provisional dataset
𝐷𝐵𝑝𝑟𝑜𝑣𝑖𝑠𝑖𝑜𝑛𝑎𝑙 and batch dataset 𝐷𝐵𝑏𝑎𝑡𝑐ℎ and to increase productivity it is efficient to remove the rest
of the image as part of our pre-processing work in order to have a noise free sample.
We explore a basic AL technique proposed by Cohn et al. for nontrivial improvements [82]. We have
followed a similar technique while eliminating the noisy images from the batch dataset. Under this
assumption, we can consider M as a classifier model which is trained with the provisionally labeled
dataset. We can think of this algorithm while we are requesting an image from each batch data 𝐵𝐷 that
based on evaluation score against 𝑀, we eliminate the noisy image 𝐸𝑖 . Here, 𝑇𝐻 is threshold and 𝐶𝐿 is
the correctly labeled batch data. A formal definition of this algorithm given below:
41
Input: Unlabeled batch data, pre-trained model trained with provisionally labeled dataset
Output: Labeled batch data 𝐶𝐿
𝑩𝒆𝒈𝒊𝒏
𝐵𝐷 = 𝐶𝐿
𝑡𝑡+1
𝑒𝑙𝑠𝑒
𝐵𝐷 𝐵𝐷 – 𝐸𝑖
𝐼𝑓 𝑡 = 𝑛
𝑬𝒏𝒅
𝑬𝒏𝒅
The algorithm described the process of using AL to eliminate a noise image. It is part of the main
algorithm and removes unwanted images which are wrongly labeled with a low evaluation score. One
of the challenges of working with AL is training data of a very small size data and having to rely on
feedback driven queries. In addition, while selecting multiple examples in a batch mode it may select
similar image. Moreover, Dasgupta et al. [84]showed exponential improvement is not always
achievable in AL.
In order to address the above issues, we combined AL and SSL. While active learning focuses on
exploring the unknown aspect, on the other hand, semi-supervised learning considers the unlabeled
data. ASSL leverages the advantages from both active learning and semi-supervised learning. Our
rollback based ASSL iteratively exploits the unlabeled data via both active learning and semi-
supervised learning. One of the main objectives of using ASSL is a smaller amount of labeled data can
42
achieve a similar prediction performance that larger unlabeled data also does. Our experiment results
show that ASSL achieves a better performance than conventional learning techniques. Due to that fact,
semi-supervised learning and active learning exploit the unlabeled data in a different way, therefore the
Finally, we evaluate each set of batch data 𝐷𝐵𝑏𝑎𝑡𝑐ℎ against the trained model 𝑀0 and check the
evaluation score we consider it as 𝑇𝐻. Based on the evaluation score, we decide to retrain or eliminate
the 𝐷𝐵𝑏𝑎𝑡𝑐ℎ image sample. We keep retraining 𝐷𝐵𝑏𝑎𝑡𝑐ℎ until it reaches towards the convergence. In
this way we get the optimally labeled image 𝐷𝐵𝑜𝑝𝑡𝑖𝑚𝑎𝑙 and the optimized model 𝑀𝑜𝑝𝑡𝑖𝑚𝑎𝑙 . Moreover,
depending on the 𝐷𝐵𝑜𝑝𝑡𝑖𝑚𝑎𝑙 evaluation score we decide when to utilize the rollback option and which
In this section we present the network architecture use in our experiments. The primary benefit of using
a deep convolutional network in our method is that instead of training the network from scratch, we
can exploit the already trained with large dataset pre-trained network model. We use VGG16 pre-
trained model from the Oxford Visual Geometry Group [20][33] and Inception model [59] in our
experiment. The sample input images are carried through the CNN network. Our network architecture
uses the similar principles inspired by Simonyan & Zisserman et al [28]. First, each pre-processed
image of size (𝐻 × 𝑊) is determined as the input of the network. The rescaled image is carried through
the number of convolutional layers where the filter size is small only 3 × 3 . The padding used here is
1 pixel for every 3 × 3 convolutional layers. Pooling is completed by five max-pooling layers. Let
convolution transpose by 𝑦 to get from 𝑥 and here 𝐻 ′ is the length of the filter[35].
43
max
𝑌𝑖" 𝑗" 𝑑 = 1≤𝑖 ′ ≤𝐻 ′ ,1≤𝑗 ′ ≤𝑊 ′ 𝑥𝑖" +𝑖′ −1,𝑗" +𝑗′ −1,𝑑. (3.2)
For max pooling stride is 2 and window is 2 × 2 pixel. The first two fully linked layer are 4096
channels each and the last fully connected layer has 1000 channels for classification [1]. The final layer
known as the softmax layer. Softmax operator [35] can be calculated as follows:
𝑥
𝑒 𝑖𝑗𝑘
𝑦𝑖𝑗𝑘 = 𝑥𝑖𝑗𝑡 (3.3)
∑𝐷
𝑡=1 𝑒
All hidden layers are equipped with the Rectified Linear Unit (RelU) activation function. ReLU
44
3.3 EER-ASSL Learning
The proposed method utilizes collaborative sampling in EER-ASSL. The EER-ASSL incorporates the
rapid modeling capability from the EER method[86][87] and incremental learning capability of CNN
[88]. The method takes advantage of the AL algorithm and the bin-based SSL algorithm with the
rollback functionality. The method minimizes the training time and at the same time keeps a high-
quality labeled dataset and high-accurate person detector in an adaptive learning process [89]. The
45
collaborative sampling method can select more informative and reliable samples with low redundancy.
Since the criterion of uncertainty can trigger the selection of noisy or redundant samples, the diversity
This brief sketch of the EER-ASSL method utilized for rapid adaptive object detection in a
dynamically changing environment. A batch of samples is selected based on uncertainty and diversity
sampling from an image stream, instead of a single image at a time. After collaborative sampling is
finished, in the next step samples are divided into bins for the rollback SSL algorithm. Each bin consist
of unlabeled samples. The pseudo labeled by the bin-based SSL using both the CNN model and EER-
based rollback learning method. In many cases the pseudo labeled dataset contains incorrectly labeled
samples or biased labels. The noisy samples should be excluded from the confident samples since such
samples do have a harmful effect to building a better person detector. The proposed EER-based
learning method is adopted to apply rapid forward and rollback learning and this lead to more
We use a very limited number of labeled data for training the EER model and the fine-tuned CNN
model. The fine-tuned CNN model and EER model are built using the limited labeled data samples.
The ensemble network consists of the CNN detector and the EER model in order to conduct person
detection. The incremental ASSL is incorporated, where a batch of data samples are collected from an
input data stream, and selectively sampled by the collaborative sampling algorithm[88]. We partition
the selected samples into number of bins. The initially labeled dataset and the pseudo training dataset
by the CNN and EER ensemble are used for training new CNN and EER models, i.e., and these are
used to update the models for the next bin cycle. The new EER model is involved in the rollback
46
learning process which consists of the removal, relabeling, and reselecting of samples from the bin, if
necessary. The bin-based incremental learning is processed with the forward learning for the sample
reselection, and the rollback learning for the removal and relabeling. The new CNN model is also used
in the process of the new collaborative sampling. The process is repeated until convergence.
The expected error reduction (EER) method that popular in pattern classification problems
[86][87][90][91] is deployed in EER-ASSL for improved performance. The main goal of the EER
method is to select samples that can reduce generalization error in the next step. Because we cannot
check the testing dataset in advanced a portion of the validation dataset is used for calculate the future
error. The future errors are approximately calculated similarly using the expected log-loss over the
unlabeled data set, where p ≪ 𝒒 . If a selected sample x is labeled y, and added to L, it is denoted by
𝑳+ = 𝑳⋃(𝒙, 𝒚). Let 𝒈𝑳 denote the EER model from 𝑳 and 𝒈𝑳+ from 𝑳+ . we adopt the log loss, EER
𝑥 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥∈𝑈𝐿 ∑ 𝑃(𝑦|𝑥; 𝑔𝐿 ) ×
𝑦∈𝐶
where 𝑪 represent the person classes, the first term 𝑷(𝒚|𝒙; 𝒈𝑳 ) denotes the label information of the
current model, and the second term is the sum of the expected entropy on the unlabeled data U with the
model 𝒈𝑳+ . Eq. (3.5) represents the serial mode learning process and updated immediately after the
labeling of each new data sample in U Eq. (3.5) is rewritten bearing in mind that bin 𝑩𝒊 as follows:
47
𝒙∗𝑩𝒊 = 𝒂𝒓𝒈𝒎𝒊𝒏𝒙∈𝑩𝒊 ∑ 𝑷(𝒚|𝒙; 𝒈𝑳 ) ×
𝒚∈𝑪
Here, the first part represents the label information of the current model, and the second term
represents the sum of the expected entropy on the unlabeled data bin 𝑩𝒊 with the model𝒈𝑳+ .
The forward learning process mainly take care of the bins for reselection and retraining. Later
reselected samples are put into the current labeled dataset in order to retain CNN model for the bin-
based SSL.
The goal of rollback learning process is find out the uncertain label samples that hamper the current
model performance and replace that sample with a new sample or relabeling that sample. Moreover
rollback learning has the ability to select the particular model based on the evaluation score on the
other hand feedback learning only can acknowledge the outcome. Moreover, rollback has the ability to
comeback to it’s previous state. We use the EER sampling in order to minimize the expected entropy
over pseudo labeled data samples. Considering the computation time, we only inspect the most recent
pseudo labeled dataset instead of the entire dataset. Using the rollback process we remove or relabeled
pseudo labeled samples. The rollback learning process conducts the certification of the labels of the
rollback sample by relabeling it or reselecting from its neighborhood. The rollback learning has two
48
3.5.1 Removal process
The main objective of EER rollback learning is to minimize the expected entropy. Rollback learning
can be divided into two steps a) removing unnecessary pseudo labels that hampers the EER model
performance and b) relabeling the samples for update the model. The rollback removal process is
formulated as follows:
Here, 𝐿\(𝑥, 𝑦 † ) denotes the labeled dataset. As the Eq. (3.7) requires a heavy computation time is not
computable in practice, rollback samples, are selected from the pseudo labeled data samples of the
current step.
In the AL process, a batch of data samples is collected from an input data stream, processed by the
collaborative sampling algorithm for the informative samples with minimum redundancy and then
partitioned into bins. The EER combined with bin-based SSL for rapid adaptive learning. Instead of
using the large data a limited number of labeled samples are used create CNN model. If the
performance of CNN is hampered in learning, the EER method is applied for rollback learning. The
labeled dataset is denoted as 𝐿𝐷 is enlarged by adding the pseudo labeled data samples. The increase
LD is used by CNN and EER models. The process is repeated until convergence. Therefore, EER
49
rollback model provides a rapid short- term adaptation, and a confident and the CNN detector model an
Let 𝐷𝑑𝑖𝑣 denote the samples mined from the current batch after the collaborative sampling. The
detailed discussion can be found in[88] . The rollback bin-based SSL algorithm explained below. We
consider 𝐷∆ indicate the confidential batch dataset for the bin-based SSL. If the cardinality of 𝐷∆
becomes confidence parameter 𝛾, the confidence sample selection process is stopped. 𝐷∆ is initialized
with a sample that satisfies 𝑥𝑡𝑜𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥 ∈ 𝐷𝑑𝑖𝑣 𝑓(𝑥), 𝑥𝑡𝑜𝑝 ∈ 𝐷𝑑𝑖𝑣 . The confidential sampling
strategy selects a sample from 𝐷𝑑𝑖𝑣 and adds to 𝐷∆ according to the distance metric of the current deep
feature space using 𝑥𝑡𝑜𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥 ∈ 𝐷𝑑𝑖𝑣 {𝑚𝑎𝑥𝑥𝑖, 𝑥𝑗 ∈𝐷∆ 𝑑(𝑥𝑖 , 𝑥𝑗 )} , where 𝑑(𝑥𝑖 , 𝑥𝑗 ) is Euclidian
distance between two samples 𝑥𝑖 𝑎𝑛𝑑 𝑥𝑗 in the deep feature space. The CNN is retrained using the bin
sequence from the confidential samples in 𝐷∆ . The confidential samples are partitioned into the bins,
and stored in bin pool denoted by 𝑩(= {𝐵𝑗 }𝑗∈𝑩 ) or = {𝐵0, . . ., 𝐵𝑗, . . . 𝐵𝐽 } .
In each rollback bin-based SSL step, the confidence scores are assigned to the pseudo samples by the
current CNN detector. The labeled data 𝐷0 is used to initialize CNN detector model 𝑓0 and EER model
𝑔0 in the beginning, respectively. 𝐴𝑐𝑐0 is calculated by 𝑓 0using the validation data. For each bin, we
𝐵 𝐽
build the CNN models {𝑓0 𝑗 }𝑗=1 using 𝐷0 ∪ 𝐵𝑗 , respectively. Let 𝐴𝑐𝑐1 indicate the maximum
𝐽 𝐵 𝐵
accuracy among the scores of the bins calculated by {𝑓0 𝑗 }𝑗=1 , i.e., 𝐴𝑐𝑐1 = 𝑚𝑎𝑥{𝐴𝑐𝑐0 𝑗 } . If the
𝐵𝑗
∗
performance improved, i.e., 𝐴𝑐𝑐1 ≥ 𝐴𝑐𝑐0 , in next step by updating, 𝐷1 = 𝐷0 ∪ 𝐵∗ and 𝑓1 = 𝑓0𝐵 . At
𝐵 𝐽
time step i, for each bin, build the CNN models {𝑓𝑖 𝑗 }𝑗=1 using 𝐷𝑖 ∪ 𝐵𝑗 , respectively, and 𝐴𝑐𝑐𝑖+1 =
𝐵
𝑚𝑎𝑥{𝐴𝑐𝑐𝑖 𝑗 }. The cases are divided into three: Case 1) 𝐴𝑐𝑐𝑖+1 ≥ 𝐴𝑐𝑐𝑖 , Case 2) 𝐴𝑐𝑐𝑖 − 𝜏 < 𝐴𝑐𝑐𝑖+1
𝐵𝑗
𝐴𝑐𝑐𝑖 , and Case 3) 𝐴𝑐𝑐𝑖+1 ≤ 𝐴𝑐𝑐𝑖 − 𝜏, where 𝜏 is a tolerance threshold for an exploration potential.
50
∗
Case 1: we get the best bin for the next step and update 𝐷𝑖+1 = 𝐷𝑖 ∪ 𝐵∗ and 𝑓𝑖+1 = 𝑓𝑖𝐵 ; 𝐵𝑖 = 𝐵∗ ;
Case 2: we conduct the following sub-steps: 1) find the removal samples from 𝛥𝑖 using the rollback
learning process using Eq. (3.11), 2) find the relabeling samples, and assign them new labels from 𝛥𝑖
using the rollback learning process based on Eq. (3.9), and 3) update 𝛥𝑖 by reselection using the EER
The above the forward-rollback learning processes are repeated, until the condition of 𝐴𝑐𝑐𝑖+1 ≥ 𝐴𝑐𝑐𝑖
or 𝐴𝑐𝑐𝑖+1 ≤ 𝐴𝑐𝑐𝑖 − 𝜏 or a time limit. If the condition 𝐴𝑐𝑐𝑖+1 ≥ 𝐴𝑐𝑐𝑖 is satisfied, we update 𝐷𝑖+1 =
𝛥 𝛥
𝐷𝑖 ∪ 𝛥𝑗 , 𝑓𝑖+1 = 𝑓𝑖 𝑗 , 𝑔𝑖+1 = 𝑔𝑖 𝑗 , and 𝑩 = 𝑩\ 𝐵𝑖 .
Case 3: oracle labels incorrectly labeled data in 𝐵∗ , and update 𝑓𝑖+1 = 𝑓𝑖 , 𝑔𝑖+1 = 𝑔𝑖 , 𝐷𝑖+1 .
The rollback process of Case 2 can reduce significantly the oracle labeling steps. (𝐷𝑖 ∪ ∆𝑖 ) is used to
build a training data set 𝐷𝑖+1 , which is used for training 𝑓𝑖+1 and 𝑔𝑖+1 at time t. The process is repeated
until convergence. Finally, the rollback bin-based SSL produces the two models f and g, and enlarged
labeled dataset LD. The combination the EER based rollback learning and the bin-based SSL allows
obtaining a rapid adaptive object detector, even from the noisy streaming samples under a dynamically
51
𝐵
𝐵 ∗ = argmax{𝐴𝑐𝑐𝑖 𝑗 }, 𝐷𝑖+1 = 𝐷𝑖 ∪ 𝐵 ∗ ;
𝐵𝑗
𝐵∗
𝑓𝑖+1 = 𝑓𝑖 ; 𝐵𝑖 = 𝐵 ∗ , and 𝑩 = 𝑩\ 𝐵𝑖 .
4. Else if 𝐴𝑐𝑐𝑖 − 𝜏 < 𝐴𝑐𝑐𝑖+1 < 𝐴𝑐𝑐𝑖 ,
While 𝐴𝑐𝑐𝑖 − 𝜏 < 𝐴𝑐𝑐𝑖+1 < 𝐴𝑐𝑐𝑖 ,
4.1 Remove the samples from Δ𝑖 , i.e., the
removing rollback process using Eq. (5).
4.2 Relabel the samples in Δ𝑖 i.e., the relabeling
rollback process using Eq. (7).
4.3 Reselect the samples from 𝐵𝑖 using the
forward learning process using Eq. (3).
4.4. If 𝐴𝑐𝑐𝑖+1 ≥ 𝐴𝑐𝑐𝑖 , 𝐷𝑖+1 = 𝐷𝑖 ∪ Δ𝑗 ,
Δ Δ
𝑓𝑖+1 = 𝑓𝑖 𝑗 , 𝑔𝑖+1 = 𝑔𝑖 𝑗 , and 𝑩 = 𝑩\ 𝐵𝑖 . i++
Else if 𝐴𝑐𝑐𝑖+1 < 𝐴𝑐𝑐𝑖 − 𝜏 or time limit, oracle
labels incorrectly labeled data in 𝐵 ∗ .
𝑓𝑖+1 = 𝑓𝑖 , 𝑔𝑖+1 = 𝑔𝑖 ,𝐷𝑖+1 , i++.
Return {𝑓 = 𝑓𝑖+1 , 𝑔 = 𝑔𝑖+1 , 𝐿𝐷 = 𝐷𝑖+1 }
3.7 Summary
In this chapter, we have presented the Rollback based ASSL method in detail with expected error
reduction sampling. The EER-ASSL utilizes active learning and forward and rollback learning. The AL
allows to select more informative samples measuring uncertainty and diversity. We have described
how Semi supervised learning and Active Learning play a major role and while Rollback addresses the
critical issues of performance improvement. In the next chapter, we described our algorithm applied for
52
Chapter 4
In this chapter, we present in detail the techniques about general ASSL that applied in single person
action recognition, two person interaction, and EER-ASSL based action recognition. Human action
consists of recognizing action from still images where simple day to day activities are considered such
as walking, sitting, taking a photo, using computer etc. These type of action may include situations
where the observation of people is necessary such as pedestrian crossing the street, hospital patient in
need of care or some restricted area that requires monitoring. We further explain two types of static
human action recognition such as one person action recognition and two person interaction recognition
with the Rollback based ASSL framework. Finally, we discuss the chapter summary in section 4.3.
Rollback based ASSL that classify single person action recognition that we can see in figure 4.1. Each
53
Figure 4.1: architecture for one person action recognition
In our method we demonstrate robustness by utilizing deep features incorporated with active semi-
supervised learning explained below. The normalization of intensity [31] and cropping play a major
role in reducing noise in our method. At the beginning of our experiment, we get rid of unnecessary
portions of the image that is not useful for human action, and only preserve the ROI of human
interaction. We normalized the image to size 224 × 224. The same human action may varies in the
feature vector therefore Image intensity and differences play a significant role for action recognition.
We minimize lighting interference by altering the pixel intensity variety, which helps to reduce contrast.
Normalization transforms [90] a color image to a grayscale image, 𝐼, with the intensity value range
(𝑀𝑖𝑛 , 𝑀𝑎𝑥 ):
54
into a new image where the intensity values change to the range (𝑛𝑒𝑤𝑀𝑖𝑛, 𝑛𝑒𝑤𝑀𝑎𝑥).
We eliminate the illumination interference via intensity normalization using the Gaussian weighted
average [32] of light intensity. The work explained below. Initially, each pixel value is deducted from
its neighbor Gaussian-weighted average. Second, each pixel is split by the neighborhoods standard
deviation. Equation (4.2) represents the pixel value calculation of intensity normalization:
𝑋−𝜇𝑛ℎ𝑔𝑥
𝑋′ = 𝜎𝑛ℎ𝑔𝑥
(4.3)
where 𝑋 ′ represents a new pixel value, 𝑋 is the unusual pixel value, 𝜇𝑛ℎ𝑔𝑥 is the Gaussian weighted
average of the neighbor pixel of 𝑋, and 𝜎𝑛ℎ𝑔𝑥 is the standard deviation of the neighbors of 𝑋.
We propose an active semi-supervised learning (ASSL) framework for action recognition from video
and single-picture data, as shown in figure 4.1. We anticipate action-specific person detection, where
each image contains personal information with bounding box data information stored in an extensible
markup language (XML) file. We trained the image dataset and made our initial model. If the initial
model did not predict the right action verified by SSL process, we employed an active learning method
55
Our proposed algorithm for single person action recognition given in the below section. This algorithm
performs well with local and benchmark dataset. The input images are unlabeled and ordered by batch
size for sequential training. The VGG16 [28] and Inception [59] pre-trained model is used with a
RMSProp [31] and Adam optimizer [30] in order to train the action dataset for the initial model. The
current batch of images are combined with the previous batch. We apply AL where the threshold value
is over 0.9. Finally, the Action Recognition (AR) classification model performance is observed and if
the performance is not satisfactory we consider an object detection model. If we correctly detect the
object then we apply ensemble learning in order to improve the classification of human action and to
reduce the likelihood of an unlucky selection of a poor one. Ensemble learning usually takes multiple
models as input and integrates them to build a predictive decision model[66]. In our algorithm,
ensemble learning combines the decisions of AL and object detection and training datasets and then
improves prediction performance. We considered a rollback procedure to obtain better results and to
reduce the amount of inconclusive data. In the rollback procedure, we compare the current batch results
with the previous batch results. If the current batch results show improvement, we do nothing; but if
the current batch performs poorly, we rolled back to the previous result. As a consequence, a new batch
56
4.2. Two person interaction recognition
The flow diagram of our framework shows in figure 4.2 where a video frame is transformed into an
image sequence, and then noise is removed, and the action area Region of Interest (ROI) is selected to
𝑈𝑛𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝐷𝑎𝑡𝑎 as 𝑈𝐷𝑖 .The training datasets labels are known as 𝑃𝑟𝑒𝑡𝑟𝑎𝑖𝑛𝑒𝑑 𝑀𝑜𝑑𝑒𝑙 (𝑀𝑖−1 ). The
VGG16 model is training with dataset and that denote the initial model as, 𝑀0 . 𝑁 refers to the
maximum number in the dataset. Each unlabeled dataset is trained with model 𝑀0 , then initializes to
𝐿𝐷𝑡𝑒𝑚𝑝 . It should be noted that in the next step, C𝑢𝑟𝑟𝑒𝑛𝑡 𝐷𝑎𝑡𝑎𝑠𝑒𝑡 (𝐶𝐷0 ) is combine with 𝐿𝐷𝑡𝑒𝑚𝑝 and
assigned to 𝐶𝐷𝑡𝑒𝑚𝑝 . It should be noted that AL is applied in the second step, if the limit of threshold
exceeds 0.9, and it is denoted by 𝐿𝐷𝑖 . Subsequently, 𝐶𝐷𝑖−1 is joined with 𝐿𝐷𝑡𝑒𝑚𝑝 and trained with
model 𝑀𝑡𝑒𝑚𝑝 . Finally, model 𝑀𝑡𝑒𝑚𝑝 is compared to the previous model, and if the current model’s
57
performance is higher than the previous model then training continues otherwise, the dataset is rolled
back. The roll back procedure helps to obtain a better performance. If the current batch of data
performs not well then we can come back to the previous better performed model. It is one of the
important feature of rollback learning that replace that current batch data with the noise free data.
58
4.3.1 Person Detection
Many of the CNN based person detectors are designed for static data distribution and unable to handle
drift, fast motion and occlusion problem during person detection in the real world environment.
Moreover, in a dynamic environment with complex setting person detection becomes challenging due
to aspect ratio difference and vanishing problem. Our proposed EER-ASSL (discussed in chapter-3)
based method can overcome this challenge where the prevailing person detector is applied in order to
get the desired detection result. When the person shows similar appearances, a possible occlusion is
investigated. We adopted state of the art object detector YOLOv2[92] in our EER-ASSL method that
take advantage of person detection. Most of the common detector trained their detector with thousands
of images in order improve the detection performance. But, labeling images for detection is very
expensive[92].
Initially, person detection is applied to find out whether there is a human is exist in the scenario. Since,
labeling image for classification is expensive but unlabeled data carry almost no information related to
human action recognition. Therefore, in order to improve the person detection accuracy we require
SSL algorithm. Among those large volume of samples we need to find out way to gather informative
samples that can contribute the EER-ASSL performance. Due to the fact we relied on sampling
strategy called collaborative sampling that can reduce the sampling bias. Sampling bias or wrongly
labeled sample can hamper the detection performance. Therefore, AL helps our proposed system to
overcome from wrongly labeled samples by relabeling. Moreover, expected error reduction sampling
59
becomes handy when using only a subsample of few hundred training examples are used to prepare the
The selected samples are divided into number of bins in order to process the data. In each bin we apply
AL then train and evaluate. After that bin1 is carried forward to bin2. This entire process continues
until bin no N. If the bin 1 model performance is better than the next bin we consider this as forward
learning otherwise it is considered rollback learning and that leads to the skipping of the poor scored
bin. The entire process continues until convergence is achieved. The EER-ASSL incremental learning
that includes forward learning (discussed in chapter-3, section 3.4) and backward learning (discussed in
chapter-3, section 3.5) in support with active semi-supervised learning in order to fine tune and update
Human action, such as sports activity or real world event’s contains abrupt changing in environment by
comparing the localized bounding boxes. If there is drift detected then we estimate the smoothing for
validation and apply EER-ASSL updated model (shows in figure 4.3) to overcome the drift problem.
In this way we select the best possible model for person’s action detection. In the next step we extract
the person features from input image and applied smoothing for improved human action classification
performance.
60
4.4 Summary
In this chapter, we have presented the methods of human action recognition in detail with the rollback
based EER-ASSL method being used. For the accurate human action recognition image pre-processing,
normalization, sampling strategy such as EER sampling and deep CNN training plays important role.
We have described in detail how semi supervised learning and active learning used together with
rollback address the critical issues of human action recognition and help with performance
improvement. The experiment results validate our method which we will show in detail in the next
chapter.
61
Chapter 5
In this Chapter we mainly discuss the different datasets we have used in our experiment such as ITLab
and benchmark dataset e.g. PASCAL VOC, Stanford 40, UT-Interaction, KTH and HMDB. In ITLab
dataset we have shown single person action recognition and two person interaction recognition in
different environments such as simple and complex. Later we have included EER-ASSL based action
recognition. We have presented how our method outperforms the other state of the art methods using
the Effective Adaptive Deep Learning (EADL) algorithm and Effective Hybrid Learning (EHL)
algorithm we mentioned in Chapter-3. Later we showed the evaluation results and compared them to
The datasets for human action recognition are different from each other due to dissimilar annotation
structures. At the same time, manual annotation with ground truth is time-consuming and tedious work.
Some of the activity recognition datasets available are listed below. We conduct our experiment on
three datasets: the PASCAL VOC Action [93] , Stanford 40 Actions [94], and ITLab action dataset.
A total of 10 classes of human action are included in the PASCAL VOC action dataset such as
Jumping, Phoning, Playing an Instrument, Reading, Riding a Bike, Riding a Horse, Running, Taking a
62
Photo, Using a Computer, and Walking. Among them we selected the five following actions for our
experiment which is similar to the ITLab action dataset in order to do a better evaluation. The training
and validation data has 11,530 images containing 27,450 ROI annotated object. For this experiment we
(a)Jumping
(b)Phoning
(c)Taking photo
63
(e) Reading
Figure 5.1: Five example of action categories: (a) to (e), respectively, jumping, phoning, taking a photo,
using a computer, and reading, from the PASCAL VOC 2012 action recognition dataset. These classes
There are 40 different human actions that are included in the Stanford dataset. A total of 9532 images
in this dataset with 180 to 300 images per action class. These are applauding, playing a violin, blowing
bubbles, pouring liquid, brushing teeth, pushing a cart, cleaning the floor, reading, climbing, riding a
bike, cooking, riding a horse, cutting trees, rowing a boat, cutting vegetables, running, drinking,
shooting an arrow, feeding a horse, smoking, fishing, taking a photos, fixing a bike, texting a message,
fixing a car, throwing a Frisbee, gardening, using a computer, holding an umbrella, walking a dog,
jumping, washing dishes, looking thru a microscope, watching TV, looking thru a telescope, waving
hands, phoning, writing on a board, playing a guitar, writing in a book. Among them we only select
five actions which are similar with the IT lab dataset. In order to train the network we use 1160 images
64
(a)Jumping
(b)Phoning
(c) Reading
(d)Taking photo
photo and using a computer from the Stanford 40 action dataset. Amongst 40 actions, five actions are
selected.
65
5.1.4 IT Lab Action dataset
The Intelligent Technology Lab action dataset consists of five different human actions such as
jumping, phoning, reading, taking a photo and using a computer in two environments. We needed to
train the IT Lab dataset, which consists of 1123 images where the actions are labeled with bounding
box information. We considered two environments for data gathering: simple with a solid background
and cluttered with a noisy background. We made 20 video clips where each action is represented. We
extract 1000 JPEG images from each video clip. Each batch size has 50 images which includes 10
66
Using Computer Jumping Reading Phoning Taking a photo
Figure 5.4: IT Lab action dataset with complex backgrounds. Here backgrounds are cluttered with
different objects.
architecture [95], which work extremely well in object classification in the ImageNet challenge [40].
We construct a VGG16 network [20], which showed a remarkable improvement over previous state-of-
the-art methods [96]. Our experiment use VGGNet network [20] that has five convolutional layers and
three fully connected layers and Inception network [59]. We use the Faster RCNN [34] object detector
for person detection, and our framework implement on deep learning library Caffe [97]. All
implementations were on a single server with a CUDA deep neural network (cuDNN) [98] and a single
NVIDIA GeForce GTX 970. We use Matlab 2017b with the matconvnet [99] library for both Windows
7 and Ubuntu 14.4 operation systems in our experiment. We train our initial model for 50k iterations
67
5.1.6 Evaluation
The actions categorized as work presume knowledge of the ground truth location of the person
throughout the test time. For action classification evaluation, we used the evaluation criteria defined by
the PASCAL VOC action task, which computes the Average Precision (AP) on the ground truth test
boxes. For the evaluation, we consider five kinds of human actions: Jumping, Phoning, Reading,
Taking Photo, and Using Computer, where the background is solid and noise-free known as simple
environment. We consider a threshold value of 0.9 while doing AL. We evaluate the propose
framework via human action recognition from the IT Lab action dataset, the Stanford 40 Actions
dataset and PASCAL VOC 2012. The IT Lab dataset contains these five actions in two different
environments; one with simple backgrounds (free from noise) and the other with complex backgrounds
(cluttered environments)
Table 5.1 shows action-recognition image accuracies in the IT Lab environment. The performance
improves with a higher number of images. In our experiment, the correctly classified human actions,
such as jumping. “Taking a photo” had the worst detection rate, compared to the other actions due to
Table 5.1: Number of images used for training on the IT Lab Action dataset
Table 5.1 shows the number of images used for IT Lab dataset training on human action recognition, as
well as testing.
68
Figure 5.5: Human action recognition performance on the IT Lab dataset, which uses both semi-
supervised learning (SSL) and active learning (AL) steps. In our experiment, each step consist of two
sub-steps: AL and SSL. Figure 5.5 shows that SSL attained 56% accuracy, while AL attained 88%
accuracy. In step two, with a combine dataset with more iterations, SSL and AL got 92% accuracy.
69
Fig. 5.6 shows the number of epochs with training and validation errors.
Figure 5.6: Shows the parameter performance which includes learning rate, number of epochs and
70
Here for the second dataset performance came around 92%. We generally use 50 iteration, here only 5
iteration used that reduce the batch dataset training time.
Table 5.2 shows a comparison with state-of-the-art results from the PASCAL VOC 2012 dataset for
action detection. Our approach shows a significant gain over the best results reported in the literature.
Table 5.2: Performance with five different actions from the PASCAL VOC 2012 dataset
Methods Phoning Reading Taking a Using a Jumping mAP
photo computer
RCNN [100] 72.6 74.0 83.3 87.0 88.7 84.9
Action Mask [101] 72.1 69.9 73.3 92.3 85.5 82.2
R*CNN [62] 79.9 82.2 85.3 94.0 88.9 87.9
Whole & Parts[102] 61.2 66.7 74.7 79.5 84.5 80.4
Part based Network [103] 86.1 87.4 86.5 92.4 88.2 88.8
Part Action Network [103] 86.9 88.5 87.5 92.4 89.6 90.0
Our approach 88.0 92.0 88.0 94.0 96.0 91.6
Our method perform poorly on phoning and taking photo due to human action similarity, however,
71
Figure 5.8: Human Action performance analysis comparing the proposed Effective Adaptive Deep
In figure 5.8, we show the performance difference of the baseline and Effective adaptive Deep
Learning method.
Each group of bars shows the performance of Human Action Recognition on PASCAL VOC dataset
for five actions. Each group shows two bars corresponding to baseline accuracy and the proposed
Effective adaptive Deep Learning method’s accuracy. On top of the single CNN model, we include
ensemble learning to further enhance performance, as the single model failed to obtain satisfactory
performance. In our case, we kept a discrete model for objects and humans. In order to obtain
combined performance, we independently trained multiple differently initialized CNNs and output
their training responses. To train the ensemble weights, a loss was defined as the weighted ensemble
response, and thus, weights were optimized to minimize loss, as shown in figure 4.6 for testing, the
Table 5.3: comparison between the IT Lab dataset and the Stanford 40 Actions dataset.
Approach Accuracy (%)
Test set (50,0) Test set (50,16) Test set (50, 33) Test set (50,49) Test set (50, 66) Test set (0,66)
In Table 5.3, Test set (P, Q) means the test data that is combined with P from the IT Lab action dataset
and the Q test data from the Stanford 40 Actions dataset. Here, the IT Lab dataset is considered as a
simple background, which means no cluttered background. The number of training data is 200, and the
number of test data is 50, with 10 images for each action. Similarly, we use the Stanford 40 Actions
72
dataset where training data is total 1160 for five actions, and the test data total 66. Our training
parameters were as follows: batch size: 64; total number of epochs: 300; learning rate: 0.001.
The results in Table 5.4 show a comparison of different state-of-the-art approaches using the IT Lab
Action expression dataset and the Stanford 40 Actions dataset. Notice that our method outperforms the
other techniques using the mixed test data from the IT Lab and the Stanford 40 Actions dataset. Our
method gain the lowest accuracy for test set (0, 66) and obtain the best accuracy for test set (50, 0),
Table 5.4: Recent different methods and performance results on the Stanford 40 dataset
Table 5.4 demonstrate the performance of our method on PASCAL VOC 2012 dataset. Here, our
propose method trained with Inception model and Adam optimizer. We compare our results
against[96], which show that the action recognition method with Effective Adaptive Deep Learning
based ensemble learning yield superior performance 90.9 Mean AP on PASCAL VOC 2012 dataset.
conditional data of various numbers. We check the performance improvement with additional data in
the training set. Two person interaction recognition experiment carried out on two benchmark such as
UT-Interaction dataset [104], HMDB: a large human motion database [80] and the ITLab action
73
datasets to check our framework’s general performance. We use these two benchmark dataset because
both dataset has similar human actions such as hugging and fighting.
There are six categories of human action interactions in the UT-Interaction dataset: shake-hands, point,
hug, push, kick and punch [104].There are total of 20 video clips of approximately 1 minutes length.
Each video clips includes a distinct background, scale and illumination. The hug and punch have
similarity to our work among them. We convert the UT-Interaction dataset video files into image
sequences to train the network models and use 835 images from them.
categorized into five classes: 1) General facial actions, 2) Facial actions with object manipulation, 3)
General body movements, 4) Body movements with object interaction, 5) Body movements for human
74
interaction. Among them for experimental evaluation, we consider hug and punch. We use 1133
Figure. 5.10 Example of HMDB dataset where action category is punching and hugging.
arms, Talking, and Kidnapping. We use two environments for two person interaction recognition data
gathering: relatively simple with a fair background and cluttered background. We recorded 20 video
clips where each action is included. We extracted 1000 JPEG files images from the each video clips.
75
5.2.4 Experiment Setup
We use VGG [20] and Inception [59] pre-trained network that utilize ImageNets for a 1000-category
classification task. We also use adaptive gradient algorithm such as RMSprop[31], Adam[30] for
optimization. We use similar hardware and library environment for experiment as we did for single
person action recognition such as cuDNN (Deep Neural Netowrk Library) [35], Matlab 2017a with the
5.2.5 Evaluation
For the evaluation of our framework, we consider five types of actions: Hugging, Fighting, Linking
arms, Talking, and Kidnapping. From those 20,000 video frames, we select noise-free images from the
unlabeled data, and then, incrementally updated the model. We contemplate a threshold value of 0.9 for
AL.
Figure 5.11: Two-person action in simple environments of the ITLab action dataset.
We create four training data sets and one test data set for a simple environment. The training dataset
and the test dataset are labeled before training. Performance is shown in figure 5.11, where the initial
training set’s performance is not good (around 44% accuracy) due to noisy dataset. Due to Active
76
learning apply with an updated model, performance is gradually increases. For the second dataset, AL
performance is merely 52%, and similarly, after a third dataset, accuracy reached 78%. After the final
dataset training is complete, the performance attains up to 96%. We find that the best case, with
training and validation set using a simple background, is 98%; however, on average, performance is
77
(a) Hugging (b) Fighting (c)Linking arms (d) Talking (e)Kidnapping
For the complex-background dataset where background is noisy, we consider four training sets and one
test set. The training dataset and test dataset are labeled before training, as with the simple
environment. Performance results are shown in figure 5.13, where the initial training set’s performance
is around 67% accuracy. But after active learning applies to the second set, accuracy increases to 75%,
and gradually reaches 77% for the third set. The performance reaches up to 81% accuracy after the
Figure 5.14: The performance chart based dataset with complex environment.
We divided the entire simple-environment dataset into k equal-sized subsets. The validation data for
testing the model is retained by a single subset, and the remaining k-1 subsets are used as training data.
78
The cross validation proceeded k times, each of the k subsets being used as validation set, averaged to
produce a single estimation. We acquired the results shown in Table 1 after k-fold cross validation,
Table 5.5: cross validation (k-fold) based on a simple environment dataset setting.
6,7,8,9 10 92%
7,8,9,10 6 96.5%
6,8,9,10 7 94%
6,7,9,10 8 98%
6,7,8,10 9 98%
The corresponding average accuracy is showed in table 5.5 for all four ITLab training datasets.
79
Figure 5.15 reflects a confusion matrix for two-person interaction in a complex environment, where the
first line indicates hugging, with true positive classification at 80% and around 20% false positive
classification for kidnapping. The second row means the action of fighting, where 90% is true positive
classification and 10% is false positive for linking arms. The third and fourth rows shows linking arms
Table 5.6 shows the results of our proposed method, using the ASSL framework for both simple and
complex environments.
80
Table 5.7 shows the efficiency of adaptive the Adam optimizer [30] using VGG model. There are six
action classes in UT-interaction dataset in two different sets such as set 1 and set 2. The precision,
recall and F-measure demonstrate that the our method outperforms other techniques [73].
Table 5.8 displays the efficiency of the adaptive gradient algorithm RMSProp [31] using the VGG
model. Hug and Point performs well among all the human actions. The precision, recall and F-measure
shows effectiveness of our method compared to state of the art techniques [73].
Table 5.9 results show the different technique of comparison and their result on the UT-interaction
81
We evaluated the performance on HMDB51 dataset on a limited scale such as hug and fight due to
many of the actions are the single person and poorly visible. In Table 4.10 “Test set (X, Y)” means the
test data which is combined with Y from HMDB51 dataset and X test data from the ITLab action
dataset. The lowest accuracy for Test set (0,1050) is 69% which is better than Varol et al. [63].We
obtain the best performance on Test set (200,0) with VGG pre-trained model applied by RMSProp
optimizer.
Extensive experiments are conducted using the benchmark dataset such as KTH action dataset and
ITlab dataset. The performance are compared with recent popular methods such as multi-label
single NVIDIA TITAN X with cuDNN[105], python3.6 and Tensorflow[108]. For this experiment we
waving, jogging, running and walking. There are 25 subjects that recorded with a static camera with
25fps where the resolution is 160×120 pixels. All sequences were taken over homogeneous
background. We divide this images into two sets training and test sets. Our action recognition model is
82
Hand waving Jogging Running
Using the local noisy dataset and benchmark dataset, we compare EER-ASSL and SSL that shows
much improvement on result. We have used Adam optimizer where learning rate 0.001 are shown in
figure 5.17.
83
Figure 5.17 The information flow of the rollback based EER- ASSL and SSL
Using the local dataset images, we compare EER-ASSL and SSL which shows improved result on
local and benchmark datasets. In the beginning, EER-ASSL shows lower performance but gradually
improve the performance, specifically after the fourth bin. On the other hand SSL remain very
5.3.2 Compare with the state of the art Methods on KTH action dataset.
EER-ASSL based human action recognition is compared with several other method. We divide the
dataset into 80 and 20 ratio. Where 80 percent data used as training data and 20% for testing purpose.
For the each video clip there are 450 images extracted. For the test dataset similar number of image
extracted.
84
Methods KTH action dataset
BoW(STIP) [109] 91.8%
ML-HDP(STIP)[107] 91.3%
ML-HDP (IDT)[107] 95.8%
ML-HDP(sTDD)[107] 94.1%
ML-HDP(IDT+sTDD) [107] 96.5%
Ours(EER-ASSL for Action recognition) 96.9%
Table 5.11 demonstrate the performance of our method on KTH dataset. Here, our propose method
trained with VGG model and Adam optimizer. We compare our results against (ML-HDP)[107],Bow
(STIP) [109] which show that the EER-ASSL based Adaptive learning captures better features and
5.4 Summary
In this chapter, we have presented experimental results of single person human action recognition, two
person interaction recognition and EER-ASSL based human action recognition which validate our
85
Chapter 6
We have seen a great advancement in the area of Computer Vision in the last few years. In particular,
human action recognition performance capability improved in these research areas. If we consider the
challenge of ImageNet Large-scale visual recognition where the top-5 error reduced to 3% in 2016
[37]. Similarly, classification accuracy such as human action recognition improved greatly. In this
dissertation, we developed several methods, especially EER-ASSL that improved the human action
classification performance.
video data. Our key discovery includes: 1) Adapting Deep learning with an emphasis on the Active
learning which allows us to select correct samples for better human action recognition. 2. Apply the
rollback process combined with ensemble learning to deal with noisy data for efficient human
recognition. This thesis introduces an Active Semi Supervised Learning framework that improves the
learning task and reduce the error rate. Our experiment performed well both single person action
6.2 Contributions
6.2.1 Single person Action recognition
Single person actions are such as using a computer, jumping, phoning, reading, and taking a photo in
simple and cluttered environment. We use deep learning to train on human actions from a large amount
86
of benchmark dataset images to improve human action classification accuracy. Two major issues
working with the deep learning framework are the number of labeled images and the variations in
poses. We propose a better approach to ensemble learning, which combines an object detector and
incremental active learning (AL) to tackle human action recognition with limited training examples,
because AL alone is not sufficient to improve performance, due to the varieties of poses and
backgrounds. Our propose ASSL framework that combine with ensemble learning, improves action
recognition performance and reduces the burden of human labeling efforts. We validated our
experiments with benchmark datasets (PASCAL VOC 2012 and the Stanford 40 Actions dataset). The
proposed algorithm produces state-of-the-art results and outperforms other approaches. Our verities of
human action recognition experiment show that ASSL based technique can be applied in many
different domains, such as surveillance systems, patient monitoring, sports video analysis, etc.
The experiment result in previous chapter shows that our propose method outperform many other
techniques for the two-person interaction recognition. We are in the process of making our method
more effective that not only combines SSL and AL to tackle the issue but also use a partial training
examples. This framework is decreases the human labeling time. We represented a distinctive dataset
with an experimental methodology to backing our work and also compare it with benchmark datasets
as well. Our experiment shows good performance on benchmark Action recognition datasets UT-
Interaction [104] and HMDB51 [80] but we plan to improve the accuracy of the proposed method with
87
6.2.3 EER-ASSL based human action recognition
We propose a rapid adaptive deep learning based EER-ASSL framework that can recognize human
activities such as sports event or motion based activities in noisy environment. This is a challenging
work due to lack of labeled data, drift problem, large feature vector and large volume of raw video
files. Our experiment mainly covered one popular benchmark datasets KTH action and ITLab dataset.
Finally, this thesis provides an approach to making use of incremental active learning with a target to
improve performance. These hypotheses serve as foundational work for further research in human
action recognition in videos. Based on our method, a number of new problems can be targeted and
solved smoothly which includes event detection, security surveillance etc. Future work may include the
investigation of human action recognition using less labeled data. Looking at the results, we see that a
deep convolutional neural network is suitable for human action recognition on these benchmark
datasets, but the deep CNN still requires much more fine tuning with a large-volume dataset in order to
88
Publications
Journal Publications
1. Minhaz Uddin Ahmed, Kim Jin Woo, Kim Yeong Hyeon, Md. Rezaul Bashar, Phill Kyu Rhee,
Wild Facial Expression Recognition Based on Incremental Active Learning. Cognitive System
Research, Elsevier, 2018
2. Minhaz Uddin Ahmed, Yeon Kim Yeongh, Jin Woo Kim, Md Rezaul Bashar, and Phill Kyu Rhee,
Two person Interaction Recognition Based on Effective Hybrid Learning, Ksii Transactions on Internet
and Information Systems, February 28, 2019
3. Minhaz Uddin Ahmed, Kim Yeong hyeon, Md. Rezaul Basharb, Phill Kyu Rhee “Effective Human
Action Recognition Using Adaptive Deep Learning” Multimedia tools and application, Springer.
(Under review)
4. Dong Kyun Shin, Minhaz Uddin Ahmed and Phill Kyu Rhee “Incremental Deep Learning for
Robust Object Detection in Unknown Cluttered Environments” IEEE Access journal. 2018
5. Phill Kyu Rhee, Enkhbayar Erdenee, Shin Dong Kyun, Minhaz Uddin Ahmed, Songguo Jin,
Active and semi-supervised learning for object detection with imperfect data, Cognitive Systems
Research, Elsevier, 2017
6. Miyoung Nam, Minhaz Uddin Ahmed, Yan Shen, and Phill Kyu Rhee, “Mouth Tracking for
Hands-free Robot Control Systems” International Journal of Control, Automation, and Systems, vol.
12, no. 3, pp.628-636, 2014
7. Shin, Hak-Chul; Shen, Yan; Khim, Sarang; Sung, WonJun; Ahmed, Minhaz Uddin Ahmed; Hong,
Yo-Hoon; Rhee, Phill-Kyu; “Performance Improvement of Eye Tracking System using Reinforcement
Learning” KOREA SCIENCE journal, 2013
Conference Publications:
Minhaz Uddin Ahmed, Kim Jin Woo, Miyoung Nam, Md Rezaul Bashar and Phill Kyu Rhee, “A
Deep Learning based approach for Human Action Recognition” 1st International Conference on
Machine Learning and Data Engineering (iCMLDE), 2017
Kim Jin Woo, Minhaz Uddin Ahmed, Md Rezaul Bashar and Phill Kyu Rhee, “High Efficiency
Facial Expression Recognition based Active Semi-Supervised Learning” 1st International Conference
on Machine Learning and Data Engineering (iCMLDE), 2017
Shibo Han, Minhaz Uddin Ahmed, Phill Kyu Rhee, “Monocular SLAM and Obstacle Removal for
Indoor Navigation” International Conference on Machine Learning and Data Engineering (iCMLDE),
2018
89
Appendix A: Active semi-supervised learning tool for human action recognition.
The common dataset merging tool that used to combine two similar batch datasets mostly for training
After each dataset training, the performance evaluation is ensured by the SSL prediction. In Figure A.2
shows by using load model button we load the trained model and by the load IMDB button we load the
batch dataset we want to evaluate. Here the evaluation score is 60 which is shows in display by
90
asterisks mark. If the prediction result is less than the threshold value and has false classifications, we
apply AL.
Figure A.3 using an active learning tool for human action recognition. Here the action is using a
computer.
In Figure A.3 the grid shows the value for each human action prediction score. If the prediction score is
less than threshold value we apply AL. Then we train these datasets. The entire process continued for
91
We adopted the VGG16 discussed in chapter-2 section-2.1.5.1 pre-trained network model, which
perform well for different classification. We fine tune our network, and find that, during training, about
50,000 iterations produce optimal performance. We set the learning rate at 0.0001, and the batch size at
8. For AL human labeling, we consider a threshold value around 0.9 for optimal performance.
Figure A.4 Training the human action dataset using the matconvnet library [35] based ASSL
tool. Here the first data set shows the poor performance.
In this situation AL plays a pivotal role to identify the noisy dataset. Eliminating a noisy dataset and re-
organizing them for training is one of the important parts for performance improvement. Apart from
the dataset other parameters such as the number of iteration, the learning rate and a pre-trained model
92
Figure A.5 shows the training performance of four sets of data where set 1 and set 2 produce the same
result, set3 progressed but the set 4 produces poor result as a result training rollback to set3.
Among the total dataset, when the current dataset performance is not satisfactory then the rollback
93
Figure A.6 shows the training performance after incremental learning. Here the performance reaches
After incremental training, when the pre-trained model reaches saturation, more training has no effect
94
Bibliography
[1] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput.
Vis., vol. 115, no. 3, pp. 211–252, 2015.
[2] T. Y. Lin et al., “Microsoft COCO: Common objects in context,” Lect. Notes Comput. Sci.
(including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8693 LNCS, no.
PART 5, pp. 740–755, 2014.
[3] D. Murray and A. Basu, “Motion tracking with an active camera - Pattern Analysis and
Machine Intelligence, IEEE Transactions on,” vol. 16, no. 5, 1994.
[4] L.-C. F. Cheng-Ming Huang, Yi-Ru Chen, “Real-time object detection and tracking on a
moving camera platform,” 2009 Iccas-Sice, pp. 717–722, 2009.
[5] S.-R. Ke, H. Thuc, Y.-J. Lee, J.-N. Hwang, J.-H. Yoo, and K.-H. Choi, A Review on Video-
Based Human Activity Recognition, vol. 2, no. 2. 2013.
[6] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions as Space-Time Shapes,”
Pami, vol. 29, no. 12, pp. 2247–2253, 2007.
[7] D. G. Lowe, “Distinctive image features from scale invariant keypoints,” Int. J. Comput. Vis.,
vol. 60, pp. 91–11020042, 2004.
[8] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection.”
[9] S. Sedai, M. Bennamoun, and D. Huynh, “Context-based appearance descriptor for 3D human
pose estimation from monocular images,” DICTA 2009 - Digit. Image Comput. Tech. Appl., no.
January, pp. 484–491, 2009.
[10] A. Veeraraghavan, A. K. Roy-Chowdhury, and R. Chellappa, “Matching shape sequences in
video with applications in human movement analysis,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 27, no. 12, 2005.
[11] T. V. Duong, H. H. Bui, D. Q. Phung, and S. Venkatesh, “Activity recognition and abnormality
detection with the switching hidden semi-Markov model,” Proc. - 2005 IEEE Comput. Soc.
Conf. Comput. Vis. Pattern Recognition, CVPR 2005, vol. I, pp. 838–845, 2005.
[12] Y. Du, F. Chen, and W. Xu, “Human interaction representation and recognition through motion
decomposition,” IEEE Signal Process. Lett., vol. 14, no. 12, pp. 952–955, 2007.
[13] C. Schuldt, L. Barbara, and S.- Stockholm, “Recognizing Human Actions : A Local SVM
Approach ∗ Dept . of Numerical Analysis and Computer Science,” Pattern Recognition, 2004.
ICPR 2004. Proc. 17th Int. Conf., vol. 3, pp. 32–36, 2004.
[14] S. Prakash Sahoo and S. Ari, “On an algorithm for Human Action Recognition,” Expert Syst.
Appl., 2018.
[15] I. Laptev, “On space-time interest points,” in International Journal of Computer Vision, 2005.
95
[16] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional Two-Stream Network Fusion for
Video Action Recognition,” Cvpr, no. i, pp. 1933–1941, 2016.
[17] M. Hasan and A. K. Roy-Chowdhury, “A Continuous Learning Framework for Activity
Recognition Using Deep Hybrid Feature Models,” Ieee Tmm, vol. 17, no. 11, pp. 1909–1922,
2015.
[18] T. Wang, Y. Chen, M. Zhang, J. I. E. Chen, and H. Snoussi, “Internal Transfer Learning for
Improving Performance in Human Action Recognition for Small Datasets,” IEEE Access, vol.
5, pp. 17627–17633, 2017.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep
Convolutional Neural Networks,” Adv. Neural Inf. Process. Syst., pp. 1–9, 2012.
[20] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks:
Visualising Image Classification Models and Saliency Maps,” Iclr, p. 1-, 2014.
[21] P. F. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan, “Object Detection with
Discriminatively Trained Part-Based Models.”
[22] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor, “Glimpse Clouds: Human Activity Recognition
from Unstructured Feature Points,” 2018.
[23] J. Choi et al., “Context-aware Deep Feature Compression for High-speed Visual Tracking,” pp.
479–488, 2018.
[24] Y. Zhou, X. Sun, Z.-J. Zha, and W. Zeng, “MiCT: Mixed 3D/2D Convolutional Tube for
Human Action Recognition,” Cvpr, pp. 449–458, 2018.
[25] E. Marinoiu, M. Zanfir, and V. Olaru, “3D Human Sensing , Action and Emotion Recognition
in Robot Assisted Therapy of Children with Autism,” Cvpr, pp. 2158–2167, 2018.
[26] P. Vicol, M. Tapaswi, L. Castrejon, and S. Fidler, “MovieGraphs: Towards Understanding
Human-Centric Situations from Videos,” pp. 8581–8590, 2017.
[27] P. Wei, Y. Liu, T. Shu, N. Zheng, and S.-C. Zhu, “Where and Why Are They Looking? Jointly
Inferring Human Attention and Intentions in Complex Tasks,” Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR 18), pp. 6801–6809, 2018.
[28] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image
Recognition,” pp. 1–14, 2014.
[29] C. Szegedy et al., “Going deeper with convolutions,” Proc. IEEE Comput. Soc. Conf. Comput.
Vis. Pattern Recognit., vol. 07–12–June, pp. 1–9, 2015.
[30] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” pp. 1–15, 2014.
[31] T. and G. H. Tieleman, “Lecture 6.5-rmsprop: Divide the gradient by a running average ofits
recent magnitude.,” COURSERA Neural Networks Form. Learn., 2012.
[32] M. Stikic, K. Van Laerhoven, and B. Schiele, “Exploring semi-supervised and active learning
96
for activity recognition,” Wearable Comput. 2008. ISWC 2008. 12th IEEE Int. Symp., pp. 81–
88, 2008.
[33] B. Settles, Active Learning, vol. 6, no. 1. 2012.
[34] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp.
1137–1149, 2017.
[35] A. Vedaldi and K. Lenc, “MatConvNet Convolutional Neural Networks for MATLAB,” 2016.
[36] P. K. Rhee, E. Erdenee, S. D. Kyun, M. U. Ahmed, and S. Jin, “Active and semi-supervised
learning for object detection with imperfect data,” Cogn. Syst. Res., vol. 45, pp. 109–123, 2017.
[37] “Andrej Karpathy August 2016,” no. August, 2016.
[38] L. Fei-Fei, R. Fergus, and P. Perona, “Learning Generative Visual Models From Few Training
Examples: An Incremental Bayesian Approach Tested on 101 Object Categories,” IEEE CVPR
Work. Gener. Model Based Vis., 2004.
[39] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual
object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.
[40] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale
Video Classification with Convolutional Neural Networks.”
[41] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput.
Vis., vol. 115, no. 3, pp. 211–252, 2015.
[42] A. Z. Olivier Chapelle, Bernhard Scholkopf, semi supervised learning. 2010.
[43] D. Yarowsky, “Unsupervised word sense disambiguation rivaling supervised methods,” Proc.
33rd Annu. Meet. Assoc. Comput. Linguist. -, pp. 189–196, 1995.
[44] A. B. Goldberg, “Multi-Manifold Semi-Supervised Learning,” pp. 169–176, 2009.
[45] S. Jones and L. Shao, “A Multigraph Representation for Improved Unsupervised / Semi-
supervised Learning of Human Actions,” Cvpr, 2014.
[46] T. Zhang, S. Liu, C. Xu, and H. Lu, “Boosted multi-class semi-supervised learning for human
action recognition,” Pattern Recognit., vol. 44, no. 10–11, pp. 2334–2342, 2011.
[47] C. Charles and A. James, Document resume ed 336 049. 1991.
[48] M. Li and I. K. Sethi, “Confidence-Based Active Learning,” vol. 28, no. 8, pp. 1251–1261,
2006.
[49] J. Sourati, M. Akcakaya, D. Erdogmus, T. K. Leen, and J. G. Dy, “A Probabilistic Active
Learning Algorithm Based on Fisher Information Ratio,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 40, no. 8, pp. 2023–2029, 2018.
[50] J. Bernard, M. Hutter, M. Zeppelzauer, D. Fellner, and M. Sedlmair, “Comparing Visual-
97
Interactive Labeling with Active Learning: An Experimental Study,” IEEE Trans. Vis. Comput.
Graph., vol. 24, no. 1, pp. 298–308, 2018.
[51] S. Hao, J. Lu, P. Zhao, C. Zhang, S. C. H. Hoi, and C. Miao, “Second-Order Online Active
Learning and Its Applications,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 7, pp. 1338–1351,
2018.
[52] A. Roederer, “Active learning for classification of medical signals,” no. November, 2012.
[53] Goodfellow, Ian, Y. Bengio, and A. Courville, “Deep Learning,” MIT Press, 2016.
[54] Qiang Yang, “a Survey on Transfer Learning,” vol. 1, no. 10, pp. 1–15, 2010.
[55] G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu, “Text classification without negative examples
revisit,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 1, pp. 6–20, 2006.
[56] P. Wu and T. G. Dietterich, “Improving SVM accuracy by training on auxiliary data sources,”
Proc. Int. Conf. Mach. Learn., pp. 110–118, 2004.
[57] C. Republic, “Acl 200 7,” Comput. Linguist., 2007.
[58] C. Szegedy et al., “Going Deeper with Convolutions,” pp. 1–9, 2014.
[59] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception
Architecture for Computer Vision,” 2015.
[60] J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for Online Learning and
Stochastic Optimization,” J. Mach. Learn. Res., vol. 12, pp. 2121–2159, 2011.
[61] G. E. Hinton, N. Srivastava, and K. Swersky, “Lecture 6a- overview of mini-batch gradient
descent,” COURSERA Neural Networks Mach. Learn., p. 31, 2012.
[62] G. Gkioxari, U. C. Berkeley, R. Girshick, and U. C. Berkeley, “Contextual Action Recognition
with R*CNN,” Cvpr, 2015.
[63] I. Laptev and C. Schmid, “Long-term Temporal Convolutions for Action Recognition To cite
this version : Long-term Temporal Convolutions for Action Recognition,” vol. 40, no. 6, pp.
1510–1517, 2015.
[64] I. Laptev and P. Pérez, “Retrieving actions in movies,” Proc. IEEE Int. Conf. Comput. Vis.,
2007.
[65] Y. Iwashita, A. Takamine, R. Kurazume, and M. S. Ryoo, “First-person animal activity
recognition from egocentric videos,” Proc. - Int. Conf. Pattern Recognit., no. i, pp. 4310–4315,
2014.
[66] S. Ma, J. Zhang, S. Sclaroff, N. Ikizler-Cinbis, and L. Sigal, “Space-Time Tree Ensemble for
Action Recognition and Localization,” Int. J. Comput. Vis., vol. 126, no. 2–4, pp. 314–332,
2018.
[67] V. Delaitre, I. Laptev, and J. Sivic, “Recognizing human actions in still images: a study of bag-
98
of-features and part-based representations,” 2010.
[68] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “R-CNNs for Pose Estimation and Action
Detection,” arXiv Prepr. arXiv1406.5212, pp. 1–8, 2014.
[69] G. Ge, K. Yun, D. Samaras, and G. J. Zelinsky, “Action classification in still images using
human eye movements,” 2015 IEEE Conf. Comput. Vis. Pattern Recognit. Work., pp. 16–23,
2015.
[70] F. S. Khan, J. Xu, J. Van De Weijer, A. D. Bagdanov, R. M. Anwer, and A. M. Lopez,
“Recognizing Actions Through Action-Specific Person Detection,” IEEE Trans. Image
Process., vol. 24, no. 11, pp. 4422–4432, 2015.
[71] H. Wang, A. Kläser, C. Schmid, and C. L. Liu, “Action recognition by dense trajectories,”
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 3169–3176, 2011.
[72] X. Peng, C. Zou, Y. Qiao, and Q. Peng, “Action Recognition with Stacked Fisher Vectors,”
Eccv, pp. 581–595, 2014.
[73] K. N. E. H. Slimani, Y. Benezeth, and F. Souami, “Human interaction recognition based on the
co-occurrence of visual words,” IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
Work., pp. 461–466, 2014.
[74] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks and applications in
vision,” ISCAS 2010 - 2010 IEEE Int. Symp. Circuits Syst. Nano-Bio Circuit Fabr. Syst., pp.
253–256, 2010.
[75] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016
IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770–778, 2016.
[76] L. Sun, K. Jia, K. Chen, D. Y. Yeung, B. E. Shi, and S. Savarese, “Lattice Long Short-Term
Memory for Human Action Recognition,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017–
Octob, pp. 2166–2175, 2017.
[77] Z. Cao, T. Simon, S. E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using
part affinity fields,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017,
vol. 2017–Janua, no. Xxx, pp. 1302–1310, 2017.
[78] Y. Han, S. L. Chung, S. F. Chen, and S. F. Su, “Two-Stream LSTM for Action Recognition
with RGB-D-Based Hand-Crafted Features and Feature Combination,” Proc. - 2018 IEEE Int.
Conf. Syst. Man, Cybern. SMC 2018, pp. 3547–3552, 2019.
[79] M. S. Ryoo and J. K. Aggarwal, “Spatio-temporal relationship match: Video structure
comparison for recognition of complex human activities,” Proc. IEEE Int. Conf. Comput. Vis.,
no. Iccv, pp. 1593–1600, 2009.
[80] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: A large video database for
human motion recognition,” Proc. IEEE Int. Conf. Comput. Vis., no. November 2011, pp.
2556–2563, 2011.
99
[81] S. Hanneke, “Rates of convergence in active learning,” Ann. Stat., vol. 39, no. 1, pp. 333–361,
2011.
[82] D. Cohn, L. Atlas, and R. Ladner, “Improving Generalization with Active Learning,” Mach.
Learn., vol. 15, no. 2, pp. 201–221, 1994.
[83] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Batch mode active learning and its application to
medical image classification,” Proc. 23rd Int. Conf. Mach. Learn. - ICML ’06, pp. 417–424,
2006.
[84] S. Dasgupta, “Two Faces of Active Learning.pdf,” no. Figure 2, pp. 1–20.
[85] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the Devil in the Details:
Delving Deep into Convolutional Nets,” pp. 1–11, 2014.
[86] X. Y. Zhang, S. Wang, and X. Yun, “Bidirectional active learning: A two-way exploration into
unlabeled and labeled data set,” IEEE Trans. Neural Networks Learn. Syst., vol. 26, no. 12, pp.
3034–3044, 2015.
[87] Y. Yang and M. Loog, “Active learning using uncertainty information,” Proc. - Int. Conf.
Pattern Recognit., pp. 2646–2651, 2017.
[88] D. K. Shin, M. U. Ahmed, and P. K. Rhee, “Incremental Deep Learning for Robust Object
Detection in Unknown Cluttered Environments,” IEEE Access, vol. XX, pp. 2169–3536, 2018.
[89] J. Kwon and K. M. Lee, “Tracking of a non-rigid object via patch-based dynamic appearance
modeling and adaptive basin hopping monte carlo sampling,” 2009 IEEE Comput. Soc. Conf.
Comput. Vis. Pattern Recognit. Work. CVPR Work. 2009, vol. 2009 IEEE, pp. 1208–1215,
2009.
[90] R. Gonzalez and R. Woods, Digital image processing. 2002.
[91] A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, “Facial expression
recognition with Convolutional Neural Networks: Coping with few data and the training sample
order,” Pattern Recognit., vol. 61, pp. 610–628, 2017.
[92] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” 2016.
[93] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual
object classes (VOC) challenge,” International Journal of Computer Vision, 2010. .
[94] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, “Human action recognition
by learning bases of action attributes and parts,” Proc. IEEE Int. Conf. Comput. Vis., pp. 1331–
1338, 2011.
[95] R. Girshick, “Fast R-CNN,” 2015.
[96] S. Yan, J. S. Smith, W. Lu, and B. Zhang, “Multi-branch Attention Networks for Action
Recognition in Still Images,” IEEE Trans. Cogn. Dev. Syst., vol. 14, no. 8, 2017.
[97] Y. Jia et al., “Caffe: Convolutional Architecture for Fast Feature Embedding *.”
100
[98] S. Chetlur and C. Woolley, “cuDNN: Efficient Primitives for Deep Learning,” arXiv Prepr.
arXiv …, pp. 1–9, 2014.
[99] A. Vedaldi and K. Lenc, “MatConvNet - Convolutional Neural Networks for MATLAB,”
Arxiv, pp. 1–15, 2014.
[100] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “R-Cnn,” pp. 580–587, 2014.
[101] Z. Yu, C. Li, J. Wu, J. Cai, M. N. Do, and J. Lu, “Action Recognition in Still Images with
Minimum Annotation Efforts,” IEEE Trans. Image Process., vol. 25, no. 11, pp. 5479–5490,
2016.
[102] G. Gkioxari, U. C. Berkeley, R. Girshick, and U. C. Berkeley, “Actions and Attributes from
Wholes and Parts.”
[103] Z. Zhao, H. Ma, and S. You, “Single Image Action Recognition using Semantic Body Part
Actions,” Iccv, pp. 1–9, 2017.
[104] J. K. Ryoo, M. S. and Aggarwal, “Interaction Dataset, ICPR contest on Semantic Description of
Human Activities (SDHA),” 2010.
[105] S. Chetlur et al., “cuDNN: Efficient Primitives for Deep Learning.”
[106] W. Brendel and S. Todorovic, “Learning spatiotemporal graphs of human activities,” Proc.
IEEE Int. Conf. Comput. Vis., no. Iccv, pp. 778–785, 2011.
[107] N. A. Tu, T. Huynh-The, K. U. Khan, and Y. K. Lee, “ML-HDP: A Hierarchical Bayesian
Nonparametric Model for Recognizing Human Actions in Video,” IEEE Trans. Circuits Syst.
Video Technol., vol. 29, no. 3, pp. 800–814, 2019.
[108] M. Abadi et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed
Systems,” 2016.
[109] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-
temporal features for action recognition To cite this version : Evaluation of local spatio-
temporal features for action recognition,” 2011.
101