Download as pdf or txt
Download as pdf or txt
You are on page 1of 101

Doctoral Thesis

Roll back based Active Semi-supervised Learning for Human Action


Recognition

August 2019

Graduate School of Inha University

Department of Computer Engineering

MINHAZ UDDIN AHMED


Roll back based Active Semi-supervised Learning for Human Action
Recognition

August 2019

지도교수 이필규

이 눈문을박사학휘 눈문으로 제출함

ii
Doctoral Thesis

Roll back based Active Semi-supervised Learning for Human Action


Recognition

August 2019

Academic Advisor: Phill Kyu Rhee

A Dissertation
Submitted to the Department of Computer Engineering and the Graduate School
of Inha University in partial fulfillment of the requirements for the degree of
Doctor of Philosophy

Graduate School of Inha University

Department of Computer Engineering

MINHAZ UDDIN AHMED

iii
This certifies that the dissertation of
MINHAZ UDDIN AHMED is approved

August 2019

Dissertation Committee

Professor Yoo Sang Bong __________________

Professor Phill Kyu Rhee __________________


(Academic Advisor)

Professor Young Bin Kwon_________________

Professor Lee Seung Gol___________________

Professor Kwon Jang Woo__________________

iv
Abstract

This thesis studies the problem of recognizing human actions from a video and still images. Human
action recognition from video can have two scenarios such as single person action recognition and
multiple person involvement. Moreover, human action can be physical body movement, such as
walking, reading or it can be interaction with the environment or with the object for a specific purpose
such as using a computer, taking photographs. In addition, human action recognition specifies the
present state of the particular person and location. We became motivated about this research work due
to many emerging real-world applications.

In order to recognize human actions, we present a method known as the rollback based active semi-
supervised learning algorithm which mainly focused on single person action classification and two-
person interaction recognition. Labeling human actions from a large volume of video frame is
expensive and time consuming. Active learning combines with semi-supervised learning can
successfully reduce the labeling time when classifying human actions in a video and images. Human
action recognition is challenging because of complex backgrounds, object scale, variations in pose, and
the misinterpretation of actions. To overcome these challenges, we prepare our dataset in two
categories such as tentatively labeled data that exploits a number of pre-trained network models and
batch labeled data for training and testing. Our propose human action recognition method uses an
ensemble of multiple deep convolutional neural network (CNN) models for the training of the data
discussed details in Chapter-3 and Chapter-4. The integrated ensemble learning along with the active
semi-supervised learning method helps to improve the overall human action recognition accuracy.

We study a limited number of actions for both single person action recognition and two-person
interaction recognition, such as using a computer, jumping, reading, phoning, taking a photograph,
hugging, fighting, linking arms, talking, and kidnapping in two environments that range from simple to
complex. From the Intelligent Technology Laboratory (ITLab), Inha University, we gather two types of
data such as simple environment (with cluttered free backgrounds) data and complex environment
(cluttered background and complex) data. We achieve 95.6% accuracy for simple environment data
and 81% accuracy for complex environment data. We also conduct extensive experiments on human
action recognition benchmark datasets and obtain better performances than other state-of-the-art

v
approaches. We show that our approach is a state of the art method that requires the joint processing of
several learning algorithms for successful human action recognition. We showed the detailed
experiment results in Chapter-5.

vi
Acknowledgements

It has been my very great privilege to know and work with my advisor Professor Phill Kyu Rhee in
various capacities over the past seven years. I would like to thank Professor Phill Kyu Rhee for his
great guidance and timely support. I am grateful to have had Professor Phill Kyu Rhee as an advisor.

I am grateful to my dissertation committee members particularly Professor Young Bin Kwon,


Professor Yoo Sang Bong, Professor Lee Seung Gol, and Professor Kwon Jang Woo for their valuable
comments to improve my dissertation.

I would like to thank Dr. Md Rezaul Bashar, who has given me excellent guidance throughout my
Ph.D studies. I would like to thank Dr. Musarrat Hasan, Dr. S. M. Riazul Islam, Dr. Abol Ghasem, and
Chad Pozsgay for their encouragement and support during my time of study.

I am grateful to Kim Jin Woo, Kim Yeong Hyeon, Han shibo and other Intelligent Technology Lab
members, Department of Computer Engineering, Inha University for providing me support on many
occasions. I would like to express my appreciation to all of my friends and lab members for their
support.

I am obligated to my parents for their sacrifice for me. My younger brothers and sister’s unconditional
love and support has helped me to move forward during my PhD studies in Korea.

I have to thank many people, especially the Incheon Talk House members and the Juan Rotary Club
members for their love and kindness towards me. They have made my experience at Inha so pleasant.

vii
Table of Contents

Abstract…………………………………………………………………… i

Acknowledgement………………………………………………………… v

Table of contents………………………………………………………….. vi

List of Tables……………………………………………………………… x

List of Figures…………………………………………………………….. xi

1 Introduction

1.1 Human action recognition ………..……………….……………………… 15

1.2 Previous Approaches…………………………………………………….. 16

1.3 Motivations……………………………………………………………… 20

1.4 Proposed Roll back based Active Semi-supervised Learning ……………. 21

1.5 Thesis Organization ……………………………………………………… 24

2 Related Work

2.1 Image Classification……………………………………………………… 25

2.2 Supervised Learning …………………………………………………....... 26

2.3 Semi-supervised learning…………………………………………………. 27

2.4 Active Learning…………………………………………………………… 28

2.5 Active Semi Supervised Learning………………………………………… 29

2.6 Common Sample Strategies…..…………………………………………. 29

2.7 Convolutional Neural Network………………………………………….. 31

2.8 Transfer Learning……………….……………………………………….. 32

2.8.1 VGG ……………………………………………………………………… 33

viii
2.8.2 Inception………………………………………………………………… 34

2.9 Optimizer………………………………………………………………….. 34

2.9.1 Adam……………………………………………………………………… 34

2.9.2 RMSprop………………………………………………………………….. 35

2.10 Related work of Human Action Recognition ……………………………. 36

2.10.1 Single person action recognition………………………………….………. 36

2.10.2 Two person interaction recognition …………………………………..….. 37

2.11 Summary…………………………………………………………………. 38

3 Overview of the Proposed Method

3.1 Necessity of Active Learning…………………………………………….. 40

3.2 Pre-trained Neural Network……………………………………………… 43


3.3 EER-ASSL Learning………………………………………………….,….. 45

3.4 EER forward learning process…………………………………………… 47

3.5 EER rollback learning process…………………………………………… 48

3.5.1 Removal process…………………………………………………………. 50

3.6 EER-ASSL Algorithm……………………………………………………. 50

3.7 Summary…………………………………………………………………. 53

4 Human Action Recognition 53

4.1 One person Action Recognition………………………………………….. 53

4.1.1 One person Action Recognition Analysis………………………………… 55

4.2 Two person interaction recognition……………………………. 57

4.2.1 Two person interaction recognition Analysis…………………………… 58

4.3 EER ASSL based Human Action Recognition 58

ix
4.3.1 Person Detection………………………………………………………… 59

4.3.2 Human action smoothing and classification…………………………….. 59


4.4 Summary…………………………………………………………………. 61

5 Experiments on Action recognition

5.1 Single person action recognition………………………………………….. 62

5.1.1 Dataset Overview…………………………………………………………. 62

5.1.2 PASCAL VOC Action dataset……………………………………………. 62

5.1.3 Stanford 40 dataset………………………………………………………... 64

5.1.4 IT Lab Action dataset……………………………………………………... 66

5.1.5 Experiment Setup…………………………………………………………. 67

5.1.6 Evaluation………………………………………………………………… 68

5.1.7 Benchmark Dataset Comparison………………………………………….. 71

5.2 Two person Interaction recognition ……………………………………… 73

5.2.1 UT-Interaction dataset…………………………………………………….. 74

5.2.2 HMDB: a large human motion database …………………………………. 74

5.2.3 ITLab Interaction dataset ………………………………………………… 75

5.2.4 Experiment Setup ……………………………………………………….. 76

5.2.5 Evaluation………………………………………………………………… 76

5.2.6 Benchmark dataset comparison…………………………………………… 80

5.3 EER-ASSL based human action recognition……………………………... 82

5.3.1 Benchmark dataset……………………………………………………….. 82

5.3.2 Compare with the state of the art Methods on KTH action dataset 84

5.3 Summary………………………………………………………………… 85

x
6 Conclusion and Future Research

6.1 Thesis summary ………………………………………………………….. 86

6.2 Contributions …………………………………………………………….. 86

6.2.1 Single person Action recognition ………………………………………… 86

6.2.2 Two person interaction recognition…………………………………….… 87

6.2.3 EER-ASSL based Human action ………………………………………… 88

6.3 Future Research………………………………………………………….. 88

Appendix A: Active semi-supervised learning tool for human action 90


recognition.
Bibliography 95

xi
List of Tables

5.1 Number of images used for training on the IT Lab Action dataset………………….. 68

5.2 Performance with five different actions from the PASCAL VOC 2012 dataset ……. 72

5.3 Comparison between the IT Lab dataset and the Stanford 40 Actions dataset............ 73

5.4 Recent different methods and performance results on the Stanford 40 dataset............ 73

5.5 K-fold cross validation based on a simple environment dataset……………….…….. 79

5.6 Performance of the ASSL framework for two environments……………………….. 80

5.7 Comparison technique of classification of action with Adam optimizer ……….…... 80

5.8 Comparison technique classification of action with RMSProp optimizer.…………... 81

5.9 Several comparative techniques of action classification ………………………......... 81

5.10 Comparison of action classification on HMDB51 dataset………………………….... 83

5.11 Performance comparison of recent methods and EER-ASSL method………………... 84

xii
List of Figures

Figure 2.1 A sample human action classification where a person using a computer…….. 26

Figure 2.2 A convolutional neural network. The network takes a single input image and 33
computes N outputs. Each output is linked with a diverse loss function………

Figure 2.3 VGG16 model for ImageNet…………………………………………………. 34

Figure 3.1 Expected Error Reduction -Active Semi-Supervised Learning architecture…. 45

Figure 4.1 Architecture for one person action recognition……………………………… 54

Figure 4.2 Block diagram for two-person action recognition…………………………… 57

Figure 4.3 Flow diagram of EER-ASSL based Human action recognition……………… 58

Figure 5.1 Five example of action categories: (a) to (e), respectively, jumping, phoning, 64
taking a photo, using a computer, and reading, from the PASCAL VOC 2012
action recognition dataset. These classes illustrate human actions strongly
trained on a pose…………………………………………………………….

Figure 5.2 Five example action categories: (a) to (e), respectively, jumping, phoning,
reading, taking a photo and using a computer from the Stanford 40 action 65
dataset. Amongst 40 actions, five actions are selected……………………….

Figure 5.3 IT Lab action dataset with simple backgrounds………………………………. 66

Figure 5.4 IT Lab action dataset with complex backgrounds. ……………………………. 67

Figure 5.5 Human action recognition performance on the IT Lab dataset, which uses 69
both semi-supervised learning (SSL) and active learning (AL) steps….............

Figure 5.6 Shows the number of epochs with training and validation errors……………... 70

Figure 5.7 Shows the performance with less number of iteration………………………... 70

Figure 5.8 Human Action performance analysis comparing the proposed Effective 71
Adaptive Deep Learning with the baseline results on PASCAL VOC action
dataset………………………………………………………………………….

Figure 5.9 Examples of UT-Interaction dataset………………………………..………….. 74

Figure 5.10 Example of HMDB dataset where action category is punching and hugging… 75

xiii
Figure 5.11 Two-person action in simple environments of the ITLab action dataset…….. 76

Figure 5.12 The performance graph based on a simple-environment dataset……………… 79

Figure 5.13 Two-person action in a ITLab dataset with cluttered environment ………….... 79

Figure 5.14 Performance chart based on a complex-environment dataset………………… 80

Figure 5.15 Confusion matrix for two-person action in a complex environment………….. 81

Figure 5.16 KTH action dataset with six human action……………………………………. 83

Figure 5.17 The information flow of the rollback based EER- ASSL and SSL……………. 84

A.1 Figure. A.1 Common dataset merging tool……………………………………. 90

A.2 Figure A.2 shows a common SSL tool for evaluation……….………………... 90

A.3 Figure. A.3 using active learning tool for human action recognition. Here the 91
action is using a computer……………..……………………………………….

A.4 Figure. A.4 Training human action dataset using matconvnet library [38] 92
based ASSL tool. Here the first data set shows the poor performance………

A.5 Figure. A.5 shows training performance of four set of data where set 1 and set 93
2 produce the same result and set 4 produce poor result as a result rollback to
set3…...

A.6 Figure. A.6 shows training performance after incremental learning. Here the 94
performance reaches the saturation………………………………….………

xiv
Chapter 1

Introduction

1.1 Human action recognition

Human action recognition is an important problem because it can identify real-life personal interactions,

personal movement, and particular tasks carried out by a person, which facilitates better decision

making. Still pictures or video contain a significant amount of information on human actions.

Interpreting and understanding these actions from images help to resolve real-life problems, such as

surveillance video analysis for safety, elder patient monitoring, sports video analysis, detection of

abnormal activities or irregular behavior, human–computer interactions, and so on. As a human, we

can easily understand the activity of a person by observing with our human vision system, such as open

eyes. However, it is very expensive to use human labor to monitor the various human actions in a real

world scenarios for a long time for example visual surveillance. Therefore, many researchers main

objective is to develop a machine that can recognize human actions and serve us. Human action can be

classified several types such as simple body movement that can be running, reading, walking, jumping

or interaction with environment with object or specific purpose that can be using computer, taking

photograph, phoning, playing an instrument etc. or human action may include complex body

movement and multiple person involvement. Human action recognition is a very challenging research

area due to number of factors involve such as person’s pose variation, complex background, object

scale, occlusion, camera motion, miss interpretation of action, and large number of unlabeled action

datasets. Currently, human action recognition is a very popular topic due to numerous real world

application such as security surveillance that may include airport, bus station, railway station, content

15
based browsing, such as fast-forward to find out the important scene, monitoring elderly people

especially in hospital, video recycling, such as eliminating harmful video content from children.

1.2 Previous Approaches

The image classification problem such as human action recognition has gained substantial attention

from researchers for the past two decades. During this period classical machine learning approaches

have been proposed in order to solve these problems such as probabilistic modeling (Naïve Bayas),

Early Neural Networks, kernel methods (Support Vector Machine) and decision tree. Since the

beginning these algorithms exhibited state of the art performance for simple a classification task [1].

Before the ImageNet challenge[1] emerged, these handcrafted methods were sufficient enough to

handle the early classification problems. However, these methods have been proved hard to scale up to

large datasets. After a few years of ImageNet challenges[1], the deep learning method replaced many

of these algorithms such as SVM, and the decision tree in a wide area of applications. Many of the

image classification challenge such as MS COCO [2], ImageNet, provide a large amount of training,

testing, and validation data to solve large scale image classification challenges.

Human action recognition through video in real time has become necessary to develop a faster strategy

to recognize human action mainly in security surveillance area, sports industry, and healthcare system.

An accurate video data analysis algorithm emerged as an important research area with a number of

applications in real time. From video data, a Human action recognition consisted of two things.

Initially there was the goal of human detection and tracking. Secondly, there was a need for classifying

the action of others. When the first task is done and that person is successfully detected, then other

information such as pose and motion determines the particular action that person might be going to

16
take. Thus the combination of multiple pieces of information helps us to determine the particular action

that a person is taking or will take.

Previously human activity recognition was mostly based on the fixed camera position at a specific

angle. Most of the time the viewpoint of the person and background were fixed in that situation

therefore the background subtraction method worked well due to simplicity. Usually, the video

sequence is subtracted from the background image in order to get the foreground information. However,

the major problem of background subtraction is that it is sensitive to illumination changes. Moreover,

for the moving camera object segmentation is more challenging due to the motion of the target object

and motion of the camera. Because the background is changing frequently, the background model is

not appropriate for segmentation. One of the popular methods for moving a camera object is finding

the temporal difference of the image frames [3]. Also it is required to calculate the pixel-level motion

between two images. Optical flow can be used to find out the motion of a moving camera that

approximates the pixel level motion of two images [4]. In order to segment the person from video data

a number of features are considered such as shape, color, and motions, space-time information,

frequency transformations, local descriptors, and body modeling [5]. One of the major problems of

Space-time information is that it is only available for non-periodic human activity recognition.

Moreover, Space time volume usually considers the entire image therefore it is limited to a view point

change and occlusion [6]. On the other hand, scale-invariant feature transform (SIFT) [7] and

Histogram oriented gradient (HOG) [8] are ideally invariant to background clutters, appearance and

occlusions, rotation and scales in certain cases but do not fully capture the entire body actions. As a

result, many researcher proposed a human modeling method where 2D body modeling and 3D body

modeling are measured. For body modeling conventional optical systems commonly use markers to get

17
human motion where users wear optical markers. It helps to locate the position of the human body

parts where the markers are attached. In order to remove the effects of occlusion a number of

additional cameras are installed at different locations to make sure to have the full coverage around the

human body. Here, the total number of cameras can be several hundred which is expensive [9].

One of the common temporal classification methods is the Dynamic Time Warping (DTW) method

which measures the similarity between two temporal sequences and it is popular for its simplicity but

not suitable for a large number of human action recognition classes with many variations [10]. There

are probability based methods such as Hidden Markov Model (HMM) [11], Dynamic Bayesian

Network (DBN)[12] and discriminative models such as Support Vector Machine (SVM) and its

performance is heavily dependent on an extensive training dataset [13]. A number of methods have

been applied to recognize human actions. However, most of these methods, such as histogram of

oriented gradients (HOG)[8] , Bag of Features (BOF)[14], and space–time interest point (STIP)[15] are

extremely dependent on handcrafted feature extraction, which is time-consuming and incompetent. In

recent years, the deep convolutional neural network (CNN) has been extensively used for computer

vision applications, especially human action detection [16],[17], because of its higher accuracy and

appropriate application to real-life problems. The deep CNN learns features automatically and better

than hand-crafted features. Building a deep CNN architecture from scratch is a challenging task;

however, the transfer learning [18] technique enables information transfer, which reduces the burden of

rebuilding the architecture and time. However, producing a positive transfer between appropriately

related tasks, while avoiding negative transfers between tasks that are less related, is an important issue.

In order to overcome this challenge, researchers select more accurate state-of-the-art models from

ImageNet[1] challenges, such as AlexNet [19] and VGGNet[20], are already trained with thousands of

18
objects at higher accuracy and lower error rates. These pre-trained models improve object recognition

performance more than previous approaches, such as the Deformable Part Model [21]

Human action recognition is a very challenging task where activities are complex and diverse as well

as the number of people involved can be single or multiple in a particular event [22]. Many of the

activities are highly correlated with motion therefore the precise point of visual interest plays an

important role. Another major challenge is to recognize human action in real time using the deep

learning architecture because it requires continuous fine-tuning of the network in order to learn the

changing appearance of the target [23]. While using the Deep learning approach, on many occasions

spatial and temporal signal get coupled with each other by 3D convolution and consequently it

becomes hard to optimize the network due to dozens of 3D convolution layers. Moreover, the memory

cost of 3D convolution is very high to unaffordable when building a 3D CNN [24]. The number of

challenges in human action recognition are a large dataset with long videos, highly variable action,

partially visible objects, varying ages, and unpredictable behavior. A common human action is

different from person to person. These types of human action consist of confusion, fear, or basic

misunderstandings of emotions and effects. Recognizing these verbal and non-verbal interactions is a

challenging task [25]. Another important challenge is the model trained in the local dataset does not

adapt well to the richness of images in the real world [25]. Socially intelligent robotics is a new

growing area where a robot can read people’s emotions, motivation, and other factors which affect

behavior. People usually know how to talk to seniors, react to an angry person or naughty child, or

cheer up a friend. In order to adopt people’s emotion for social robots, classifying human action

recognition is necessary and a particular person may thinking about something e.g. searching for an

object or finding out the object’s condition [26] [27] .

19
1.3 Motivations

The objective of this dissertation is to make an effective system which exploits active and semi

supervised learning in order to achieve an accurate identification of a single person’s action or two

person interactions. Understanding people’s action is valuable due to numerous applications such as

surveillance, human computer interaction, health care, sports and entertainment, filtering out immoral

video content, and understanding a scene. All these applications require efficient action recognition

from video. Complex environments (crowded places, such as markets and bus stations) make it

difficult to recognize human actions. Other factors in real time include pose dissimilarity (for instance,

the difference between running and walking), person-to-person interactions, person-to-object

interactions (e.g. using a phone or a camera), low image quality, occlusions, noisy backgrounds, and

poor lighting conditions. As we discussed in section 1.2, most of the previous method approaches and

generative and discriminative models underperformed which motivates us to take the chance of

applying the state of the art deep learning method. The deep learning based approach together with

active learning (AL) have not much in common with previous methods. There are many occasions

where few labeled training points are available, but a large number of unlabeled data are given. In that

scenario, Semi-Supervised Learning(SSL) can play a big role Thus in this regard, we investigate more

on how efficiently we can use pre-trained models such as VGG [28], Inception [29], optimizer such as

adam [30], RMS [31], sample selection procedure, parameters choice such as the number of training

epoch, learning rate and, threshold values. The other goal of our work is to implement a competent

system which can work well even in variety of environments that change from simple to complex.

Finally, we expect an improvement in the recognition rate for single human action recognition, and two

person’s interaction recognition we discuss our method in the next section.

20
1.4 Proposed Roll back based Active Semi-supervised Learning

In this dissertation we develop a roll back based active semi-supervised learning method that can

concurrently select the best model among the set of models by forward learning or rollback learning in

order to achieve an improved performance for human action recognition. A large number of training

datasets with properly annotated data effects the quality of the performance of human action

recognition. Labeled action recognition data is hard to get due to costly nature. Moreover, labeling data

is a very time-consuming task. We address this problem by using semi-supervised learning (SSL) as

semi-supervised learning utilizes both supervised learning (labeled data) and unsupervised learning

(unlabeled data). In our work we separate the dataset in two categories such as tentatively labeled

dataset and batch dataset.

Later we apply active learning (AL) method in our each of the batch training dataset. As our

proposed method combined active learning (AL) and semi supervised learning (SSL) therefore we

named our framework as active semi-supervised learning (ASSL) to recognize single person human

action and two person interaction recognition from a single image and videos. A convolutional neural

network architecture cannot completely exploit the limited number of image sets. we proposed a more

refined way that combined (semi-supervised learning) SSL [32] and active learning (AL) [32][33] to

tackle this problem. Active semi-supervised learning (ASSL) takes advantage of both SSL and AL.

This training process repeats until the predictor achieves saturation. The ASSL method allows us to

accurately determine the classification outcome from misleading unlabeled data and increase the

existing model performance by combining previous data incrementally. Labeling a huge amount of

21
images is very time consuming and computationally expensive. Hence, our method selects the best

subset of the data and then trains the model gradually.

We train our model in a batch process manner where each batch dataset are trained sequentially

and result carries forward to the next batch. The first batch dataset outcome helps us to anticipate the

next batch result through comparison. If one batch performs poorly compared to the previous batch

then the training dataset is able to reflect the erroneous data. After careful observation of the erroneous

data our approach is able to show misclassified data, then the wrongly posed image or noisy image

which increases the possibility of poor performance is eliminated. With the aid of active learning, we

get rid of the noisy images by correctly labeling the image or by simply replacing it with a good one.

Consequently, this improves the batch training performance. Moreover, while using incremental

learning by setting a particular threshold value also helps shorten the labeling time by using semi-

supervised learning. Combining active learning and semi-supervised learning can be applied in several

areas such as object tracking, and object recognition. We observe that rollback based active semi-

supervised learning makes less of an error rate in training a dataset while it works well with small

number of samples which is significant for incremental learning. Subsequently, when the model gets

better it reduces the over fitting problem. However, rollback has not much of an effect when the model

reaches saturation or if the dataset is biased. We discuss details of our technique in Chapter-3 and

showed the performance results in various experiments in Chapter-4. We apply our rollback based

active semi-supervised learning in single person action recognition and two person interaction

recognition briefly described in the next section.

22
In our proposed single person action recognition, we included an ensemble learning based method

that exploits the deep learning model of the VGG16 network (from the Visual Geometry Group at the

University of Oxford) with an incremental active learning framework that outperforms state-of-the-art

technologies to recognize human actions. We deployed VGG16 Network in our research due to its

promising results. We consider five actions for single person action recognition such as using a

computer, jumping, reading, phoning and taking a picture for simple and complex environment. We

applied our method in two different benchmark datasets such as PASCAL VOC 2012 dataset and

Stanford 40 action dataset.

In our proposed two person interaction recognition method, we use a Faster R-CNN [34] based action

recognition framework using the matconvnet [35] library and a trained CNN for each action. Two

person interaction recognition is challenging because due to the small training dataset. The unique

actions we have used in our experiment such as kidnapping or linked arm is not common type of action

in benchmark dataset. We found two benchmark dataset such as UT-Interaction and HMDB51 have

similar actions like the ITLab two person interaction dataset actions are hugging, fighting, linked arm,

talking and kidnapping. We use these two benchmark dataset in our experiment showed in details in

Chapter-4.

23
1.5 Thesis Organization and contribution

In this dissertation we develop a rollback based active semi supervised learning method. In particular,

we develop a convolutional neural network based approach which use active semi supervised learning

method that train number of dataset for human action classification.

In Chapter 2 we provide recent overview of the relevant background such as supervised learning,

semi-supervised learning, active learning [36], Convolutional Neural Network, Transfer learning and

describe commonly on Human action classification.

In Chapter 3 we explain details overview of our proposed method and application to classify human

action recognition. The architecture is based on expected error reduction based Active semi-supervised

learning.

In Chapter 4 we explain details architecture of our rollback based ASSL system and application to

classify human action recognition from images. The architecture is based on Convolutional Neural

network with the combination of Active learning and semi supervised learning.

In Chapter 5 we present the experimental results that validating our proposed system to recognize

Human actions from image sequences and videos, two person interaction recognition. We also describe

several benchmark datasets related to our work.

In Chapter 6 we conclude the thesis with our contributions and discuss the remaining challenge for

future researches.

24
Chapter 2
Related work

Human Activity Recognition in images and videos has received an enormous attention in recent years.

This chapter provides the literature survey with necessary background on human action recognition,

supervised learning, semi-supervised learning, active learning, Convolutional Neural Network,

Transfer learning, Optimizer and related work of single person action recognition, double person

interaction recognition with summary.

2.1 Image Classification

Image Classification by a person is a simple task, however the same task done by computer is not a

straight forward approach and can be very challenging. While using a computer we label the number of

dataset and some predefined sets of categories. For example, the set of furniture where object can be a

table, a chair, a desk, and so on. When we select a specific recognition domain we are restricted

because the other object domains could be things such as models of vehicles. The image classification

problem can be categorized as binary classification where only two classes are involved but in the

multiclass classification several classes are involved. An example of binary classification might be

whether an image contains a person or not. On the other hand, the multiclass classification problem

might help to determine a specific action performed by a person amongst hundreds of actions. The

second type of problem is denoted as the image categorization task. Selecting the right set of image

categories is an important task for practical applications. While classifying objects, we should consider

few things such as the type of object, non-objects, inanimate things [37].

25
Fig. 2.1: A sample human action classification where a person using a computer.

Research indicates that human level performance has a five percent error rate on a scene understanding

task and also require 50 milliseconds to classify the scene. At the beginning of classification challenges,

the shallow methods were used such as Caltech 101 benchmark dataset where most categories had

about 50 images collected in September 2003 [38]. Later, the Pascal Visual Object Classes Challenge

offered more realistic high resolution images which becomes the standard [39]. Nowadays the state of

the art database for classification is the ImageNet [40] database which is organized according to the

WorldNet hierarchy. WordNet is described by multiple words or word phrases is known as “synset”.

There are more than 100,000 synset in the WordNet hierarchy. ImageNet provided 1000 images for

every known category[41].

2.2 Supervised Learning

We prepare our batch dataset by taking help of supervised learning for labeling the dataset as our

objective is to performance improvement of the pre-trained model. The objective is to learn a mapping

from 𝑥 to 𝑦, and training set pair is (𝑥𝑖 , 𝑦𝑖 ). These pairs are sampled from distribution and mapping is

evaluated through the predictive performance. However, labeling is costly therefore we keep the

dataset small on an average 10 images per action and wanted to achieve desired performance with less

26
label data. Each batch dataset contain 50 images for five actions. Among the two families of supervised

learning we lean on generative algorithm due to its predictive nature [42]. In our framework we

followed the generative algorithm procedure that model the class conditional density 𝑃(𝑥|𝑦) by

unsupervised learning method. We discuss details the necessity of semi-supervised learning in details

into the next section.

2.3 Semi-supervised learning

There are many occasions where few labeled training points are available, but a large number of

unlabeled data are given. In this scenario, SSL can play a big role. In mathematical formulation it can

be written as the Knowledge on 𝑃(𝑥) that one gains through the unlabeled data has to carry the

information that is useful in the inference of 𝑝(𝑦|𝑥). If this condition does not fulfill SSL has no

improvement over supervised learning. There are number of assumptions useful for SSL such as

smoothness assumption, cluster assumption, low density separation, manifold assumption. If the two

points are 𝑥1 and 𝑥2 in a high-density region are close, then corresponding output should be 𝑦1 and𝑦2

[43]. SSL seeks to enhance a big amount of unlabeled sample information with set of 𝑙 images,

𝑥1 . . . 𝑥𝑙 ∈ 𝑋, and equivalent labels 𝑦1 , … 𝑦𝑙 ∈ 𝑌. Moreover, given 𝑢 unlabeled images, 𝑥𝑙+1 , … , 𝑥𝑙+𝑢 ∈

𝑋, SSL uses the combined 𝑥1 and 𝑦1 to better classifications [22]. On the other side, semi-supervised

learning tries to train the model from unlabeled samples [23]. Jones et al. have developed a method

called as Feature Grouped Spectral Multigraph (FGSM) that is produced by clustering [45]. Zhang et

al. [46] proposed a boosting based multiclass SSL where formulate the action recognition with

insufficient labeled data.

In our approach we consider self-training also known as self-learning type of semi-supervised learning.

SSL training is way of reducing the effort required to prepare the training set by training the model

27
with two types of dataset. A small number of fully labeled examples and additional set of tentatively

labeled or unlabeled examples. A model trained in this method performs better than trained in

traditional manner. In tentatively labeled training data the labeling of each of the image area can take

the procedure of a probability distribution over labels. That makes it likely to capture a variety of

information about the training examples that a specific image has a high likelihood of containing the

human action type.

2.4 Active Learning

The specialty of this algorithm is that it can select the data which it learns from. Therefore it can

perform better with less training data [33]. One of the theoretically inspired query strategy is known as

query-by-committee. Our framework motivated by this strategy in order to achieve better performance

with less labeled data. Generally, AL achieves high performance just using a limited number of labels

and minimize the labeling cost. Because of this advantage many machine learning uses AL. However,

to overcome the labeling that is done by the supervised learning tasks and the efforts that go along with

that, are expensive to access, laborious, and difficult such as speech recognition, document

classification, Information extraction, etc. [47]. In the active learning method, [33]batch mode active

learning performs much better compared to single mode when handling a huge information because it

is less costly [48]. It is a balanced strategy where AL aims by posing queries to minimizing the labeling

job[49]. Bernard et al. conducts three part labeling strategies and they are able to identify different

Visual Interactive Labeling (VIL) approaches, investigate the strengths and weaknesses of the

aforesaid labeling strategies and to determine the impact of using different encoding to direct the visual

interactive labeling process [50]. VIL method need to use fifty labeling iterations and make a cold start

to tackle the problem of AL which can be intense. In the future, the empirical performance of VIL can

28
rely on a user based information selection method. Huo et al. [51] present a structure for the binary

classification problem known as Second Order Active Learning (SOAL) [51]. SOAL can use both First

and second order information for an efficient query approach to predict incoming unlabeled data and

create the confidence of the learner.

2.5 Active Semi Supervised Learning

We present ASSL framework which combines the deep feature we mentioned in section 2.3 and 2.4 in

Chapter-2 and batch-mode algorithm similarly [36] . We use a batch mode active learning which suite

our image sets [36][33] with little training data that cannot produce satisfactory performance, whereas

10 images per action show a better result in a single batch of the dataset. Therefore, we adjusted the

number of images for each human action in order to get a better performance. ASSL initially train deep

convolutional neural network using pre-train network and repeatedly retrains the next model by adding

new batch data. Merging the batch data with deep feature model and incremental training model is

continue until the model reaches saturation. One more circumstance is that fewer iterations shows poor

performance as a result therefore we consider minimum number of iterations at 50,000 with the

Intelligent Technology Laboratory (ITLab) dataset. Usually, a big amount of iterations needs a lengthy

training time to update the weights. We discuss more details about ASSL and ERR-ASSL in chapter-3.

2.6 Common Sampling strategies

Most of the active learning previous work considered three main types of sampling methods for

unlabeled instances [52]. All those assumptions are queries take the form of unlabeled instances which

29
is letter labeled by the oracle. Query Synthesis allows AL to query for any unlabeled instance.

Basically queries learner synthesizes from scratch by an algorithm that sampled instances from a

different distribution. These query selection process is optimized for number of advantages such as

improve efficiency. The model that produces data sometimes not available therefore it is not possible

to check the synthesized query is good for oracle or not ahead of time.

2.6.1. Stream based sampling


It is also known as selective sampling where unlabeled data are selected one at a time from the source

data. Usually, data sampled from the real world where in many cases the input distribution is uniform

and unknown. Stream-based the selective sample is suitable for real-world human action recognition as

the particular person movement in video samples are gathered at regular intervals.

2.6.2 Pool-based sampling

In many cases, a large amount of data can be gathered at one time where pool based learning becomes
handy. The Oracle selects data from the pool to query in a greedy manner. In pool based learning
evaluates the ranks the whole dataset before selecting the best query. On the other hand, stream-based
sampling scan the entire data sequentially and make query decisions of each point separately. Pool
based sampling is very popular and studied in the literature extensively.

2.6.3 Collaborative sampling


In our work, we consider uncertainty, diversity and confidence sampling as collaborative sampling due

to the number of advantages. In this sampling procedure selecting queries by allowing the algorithm to
query instances that is least certain. Least confident: The algorithm simply looks at the posterior
probabilities for each and choose the one near to 0.5. In the case of several class labels, selecting the

instance that prediction is least confident.

30
2.7 Convolutional Neural Network

Convolutional Neural Network are special kind of neural network for processing large data especially

for computer vision such as human action recognition. CNN played an important role for deep learning

in practical application. It consists input and output layer with multiple hidden layer which contains

convolutional layers, pooling layers and fully connected layers. Mainly hold the pixel intensity of the

images. Generally, an input example which is a multi-dimensional array that consists height × weight

× three color channels (red, green, blue) E.g. a 256×256× 3 color image. Convolution layer take

images from the previous layers and specify the number of filters to create images known as output

features map. Output feature maps are equal to the number of filters. Convolution is the operation of

two functions which is a real valued argument can be viewed as matrix multiplication [53]. If we

consider a weighted function 𝑤(𝑎), where 𝑎 is the age of measurement. After applying weighted

average operation at every moment, we get a new function 𝑠(𝑡) which provide the position of the

spaceship:

𝑠(𝑡) = ∫ 𝑥(𝑎)𝑤(𝑡 − 𝑎)𝑑𝑎 (2.1)

This operation is known as convolution and denoted with asterisk:

𝑠(𝑡) = (𝑥 ∗ 𝑤)(𝑡) (2.2)

Here, 𝑤 should be valid probability density function otherwise the output will not be a weighted

average. Generally, the convolutional layers apply convolution operation then send the result to the

next layer. These connected neural network is used to learn features and classify data. In the

convolutional network terminology, the first argument is considered as the input (e.g. the function 𝑥)

31
and the second argument considered as the kernel (e.g. the function 𝑤). The output sometimes known

as feature map.

Figure 2.2: A convolutional neural network. The network takes a single input image and computes N

outputs. Each output is linked with a diverse loss function.

A typical CNN consist three stages where in first stage convolution operation in order to produce a set

of linear activations. In the second stage, all the linear activation run through a nonlinear activation

function such as rectified linear activation function is also known as ReLU. In third stage pooling

function is used which contains max pooling layer which combines the maximum value from each of

the cluster of neuron at the prior layer.

2.8 Transfer Learning

Transfer learning [26] is a popular approach in deep learning that can apply one domain knowledge to

another domain. In this condition, transfer learning [26] performance significantly, without much effort

for data labeling. When the distribution changes, most of the machine learning models require

rebuilding from the scratch, using newly gather training data. The collection of new training data and

32
prepare new model is expensive. Transfer learning has applied successfully in many domains, such as

Web document classification [27]. Wu and Dietterich [28] apply transfer learning to image

classification problem that requires a large amount of labeled information, that is very costly, but

transfer learning reduces the labeling effort [30].

2.8.1 VGG

VGG is a deep convolutional neural network architecture is developed by Oxford Visual Geometry

group for large scale image classification where convolution filters are very small such as 3𝑥3 and

depth are 16-19 layer weight [28]. VGG achieve first and second position in ImageNet localization and

classification in 2014. During the training of this architecture the input size of the convent is 224x 224.

In order to prepare the convent input images are randomly cropped and rescaled. It has three fully

connected layers (FC). The first two layer have 4096 channels each. The third layer contain 1000

channels due to one channel for each class for ImageNet [1]classification. The final layer is the soft-

max layer.

Figure 2.3 portrays the VGG16 model for ImageNet [1].

Figure 2.3: VGG16 model for ImageNet

33
2.8.2 Inception

Inception is a deep convolutional neural network architecture is developed for large scale image

classification challenge which utilizes the computer resources inside network [58]. In this architecture

depth and width of the network increased while keep the budget same. Generally, most of the sparse

connection between the activations which implies output channel with input channel. Inception module

has several versions such as naive version and dimension reduction version. In the naïve version, one

problem is that a modest number of 5×5 convolutions can be expensive with large number of filters.

The problem gets bigger when pooling layer is added. Therefore this design might cover the optimal

sparse structure however it is very inefficient. This leads to go for dimension reduction version where

1x1 convolutions are used to compute before go for 3×3 and 5×5 convolutions. Also used rectified

linear activation. For the memory efficiency it is helpful using inception models only at higher layers

and keep the lower as usual. Thus improves significant quality gain at a small increase of

computational requirement [59].

2.9 Optimizer

2.9.1 Adam

Adam is a simple and computationally first order gradient based optimization algorithm [30]. It usually

optimizes stochastic objective function which require first order gradients with small memory. Adam

combines the advantage of AdaGrad [60] which works well in sparse gradient and RMSprop [61]

works well in non-stationary setting. Several advantages of Adam is that it does not require a stationary

objective. The efficiency of this algorithm improved upon by changing the order of computation. One

important feature of Adam is its step-size are selected carefully. Adam updates are calculated using a

running average of first and second moment of the gradient. Adam often outperform other methods in

34
case of multilayer Neural Network models with non-convex objective function. Adam memory

requirement which is linear in the number of mini-batches. Stochastic regularization method such as

drop out is an effective way for preventing over fitting problem. It requires less memory. Adam is

robust and suitable for wide variety of non-convex optimization problem. Adam can effectively solve

practical Deep learning application problem. Thus we included Adam in our roll back based ASSL

approach for better performance.

2.9.2 RMSprop

The common problem of big network is that if we start with big learning rate the weight of hidden unit

also become big for instance, positive and big and negative and big. In this case, due to error derivative

for the hidden units the error will not decrease. There are several ways to speed up mini-batch learning.

One way is using RMSprop which is an optimization algorithm similar to Adam [30]. RMSprop with

momentum generates its parameter update using momentum on the rescaled gradient. RMSprop lacks

bias-correction terms when 𝛽2 close to 1. In this situation, for each weight calculation RMSprop use

the sign of the gradient and adapt the step-size separately for each weight for further development [61].

Sutskever et al. [61] combined Rmsprop with Nesterov momentum which performs well. We adopted

RMSprop optimizer while train the two person interaction dataset and obtain outstanding performance

showed in chapter-4.

35
2.10 Related work of Human Action Recognition

2.10.1 Single person action recognition

Video-based human action recognition has received significant attention in the last two decades from

computer vision researchers [15] [62]. Number of algorithms are developed to efficiently recognize

human action from videos. Laptev et al.[63] used the Harris and Forstner interest point operators, and

detected local structures that have substantial local variations in both space and time. In that work each

event is classify with a jet descriptor. He describes a detected points that correspond to meaningful

events.

Laptev and Pérez et al. [64] applied boosted space–time window classifiers and applied human motion

and shapes within an action. And worked on realistic scenarios, such as a movie file with variations of

actions in terms of subject appearance, motion, surrounding scenes, viewing angles, and spatio-

temporal extent. Iwashita et al. [65] used global motion descriptors that first combine dense optical

flows and then local binary patterns (LBPs). Optical flows are separated into categories, and the

number of flows in each category is counted. Each scene is divided into a 𝑆 by 𝑆 grid where 𝑆 can be 3

for eight motion direction. As a result Histogram of optical flow (HOF) produced with 𝑆 by 𝑆 by 8 bins.

The descriptor in each grid is construct via optical flow for a short time interval of 0.2 seconds. Global

motion descriptors of the optical flow can easily identify body shake and ball play. LBPs appeared

based on features that contain the relations between pixel values in a neighborhood of a reference pixel

value, later applying a dimensionality reduction method. Also used were cuboids and STIP feature

detectors. For cuboids, they proposed normalized pixel values for the STIPs [66] using HOG and HOF.

They also applied a dimensionality reduction method to compute local motion descriptors for 100

dimensions.

36
Considerable advancements are done in person detection in images, such as STIP, HOG, Bag of words

(BOW)[67], the dense trajectory–based approach and the HMM-based approach. Most recently, the

CNN-based approach [20][16] has gained popularity due to its outstanding performance in the

PASCAL VOC and ImageNet challenges [1]. These two benchmark datasets contributed a lot to

improving object recognition in image and video files. Additionally, the Regional CNN (RCNN) [68],

Long-term temporal convolutions (LTC-CNN)[63], and the convolutional two-stream network[16],

have played important roles, whereas in our work, we focus on semi-supervised learning with an active

learning approach. Most relevant to our method is poses in human–object interaction activities[69] and

recognizing actions through action-specific person detection [70]. The continuous learning framework

uses a similar kind of dataset that used in our approach. Deep features for person detection where each

image file with object bounding–box information stored in a separate XML file has increased in many

folds due to flexibility to work with. If we consider the object detection competitions such as PASCAL

VOC, ImageNet, Microsoft Common Objects in Context (COCO). The bounding box keeps the region

of interest information of a particular object in an image.

2.10.2 Two person interaction recognition

Human action recognition is a wide research area where many algorithms exist. The previous human

action recognition work include manually engineered features, for instance Histogram Oriented

gradient (HOG) [8], On Space Time Interest Point (STIP) [15], dense trajectory [71], Fisher vector [72],

and Deformable Part Model (DPM) [21], Bag of Words (BOW) [73], which are popular for human

action recognition. Since the popularity of deep models like Long-term temporal convolutions (LTC-

CNN) [63] [74] that learn multiple layers and generate high-level classification, convnet [68][34]

gained a good reputation. However, CNNs [75][58] have developed a lot in recent years. We apply AL

37
[33] on top of SSL in support with number of sampling strategies. The substantial difference in our

method is iterative forward learning, backward learning, sample selection, and combines both SSL and

AL.

Many research found action patterns can be learned by Convolutional Neural Network and Recurrent

Neural Network[76] . Both CNN and RNN have advantage and disadvantages such as CNN based

method are good at learning appearance but not good at long term motion dynamics. On the other hand,

RNN can able to learn temporal motion dynamics. Lin et al. [76] proposed lattice-LSTM that extends

LSTM by independent hidden state transitions of memory cells. They enhance the LLSTM model and

provide a multi-model training process with diagonally time. Zhe Cao et al[77] proposed a popular

method that can detect multiple people from video and image in real-time. They applied

nonparametric representation that refers to Part affinity fields (PAF) in order to learn the body parts.

They also present the body and foot key point detector that work for hand and facial key point as

well[77]. Yun Han et al proposed handcrafted cued LSTM model for human action recognition that

work on 25 skeleton joints in 3D coordinates. Their work mainly based on human posture and human

body movement where skeleton view-invariant transformation is adopted. Moreover, recognition

performance is improved by using two handcrafted cues[78].

2.11 Summary

In this chapter we have presented some of the state-of-art technique used in our framework and a

typical workflow the way our framework take input data for training which involve supervised learning,

semi-supervised learning, active learning, transfer learning, pre-trained network such as VGG,

38
inception, optimizer such as Adam, RMSprop and related work to classify human action. Optimization

often beneficial during the training period of our dataset such as UT Interaction [79] , HMDB [80]. We

have applied both Adam and RMSprop optimizer in our experiment. We have discuss the details of our

algorithms and methodologies in the Chapter-3 and Chapter-4 to classify different human action.

39
Chapter 3

Overview of the proposed method

In this chapter, we present the overview of the proposed method in detail. We explain the necessity of

the Rollback based ASSL framework. We analyze the forward and rollback based active semi

supervised learning where collaborative sampling and expected error reduction influence the model

performance. We have shown step by step the process of the Expected Error Reduction based Active

Semi-supervised Learning (EER-ASSL) method for human action classification. Finally, we discuss

the chapter summary in section 3.7.

3.1 Necessity of Active Learning in our method

Active learning is a very powerful learning methodology which can produce an efficient classifier with

a small number of labeled data [81]. In our work, we adopted active learning in a pool-based (batch)

manner which selects more informative and better training samples with less redundancy. The large

portion of our unlabeled data gathered in ITLab environment, as well as benchmarks, are then exposed

to an AL similar to Query By committee (QBC) framework in order to minimize the version space [82].

The main objective is to improve the learning performance with a minimum number of queries. Many

of the previous active learning research focuses on selecting a single unlabeled data per iteration which

is not efficient. Batch mode active learning can address this problem efficiently by training a number of

subset images together while making the classification model [83]. In our rollback based ASSL

framework, we prepared both training and test dataset in a batch model. On an average, each batch

40
contains 25 images from five action categories. We showed in details in Chapter-5 with the

experimental result.

We consider a provisionally labeled training dataset 𝐷𝐵𝑝𝑟𝑜𝑣𝑖𝑠𝑖𝑜𝑛𝑎𝑙 ={𝑥𝑖 }1𝑙 , a human action model

(softmax classifier) 𝑆, and we compute the prediction score over N classes for each bounding box given

below which is similar to Hasan et al. [17] and Rhee et al.[36]. The current classifier applied to each

image in 𝐷𝐵𝑝𝑟𝑜𝑣𝑖𝑠𝑖𝑜𝑛𝑎𝑙 and the probability that an image 𝑥𝑖 belongs to class 𝑞 defined as:

exp(𝑊𝑉𝑞𝑇 𝑥𝑖 )
𝑆(𝑥𝑖 ) = 𝑝(𝑦𝑖 = 𝑞|𝑥𝑖 ; 𝑊𝑉) = 𝑁 (3.1)
∑𝑚=1 exp(𝑊𝑉𝑚𝑇 𝑥𝑖 )

Where 𝑞 ∈ {1 … . . 𝑁} is the set of class labels and 𝑊𝑉𝑞𝑇 is the corresponding weight vector of class 𝑞

and superscript 𝑇 denotes the transpose operation.

For each object class, a set of samples is sequentially selected and scored 𝑆(𝑥𝑖 ) using the current

classifier. The region of interest point (ROI) part is preserved for both provisional dataset

𝐷𝐵𝑝𝑟𝑜𝑣𝑖𝑠𝑖𝑜𝑛𝑎𝑙 and batch dataset 𝐷𝐵𝑏𝑎𝑡𝑐ℎ and to increase productivity it is efficient to remove the rest

of the image as part of our pre-processing work in order to have a noise free sample.

We explore a basic AL technique proposed by Cohn et al. for nontrivial improvements [82]. We have

followed a similar technique while eliminating the noisy images from the batch dataset. Under this

assumption, we can consider M as a classifier model which is trained with the provisionally labeled

dataset. We can think of this algorithm while we are requesting an image from each batch data 𝐵𝐷 that

based on evaluation score against 𝑀, we eliminate the noisy image 𝐸𝑖 . Here, 𝑇𝐻 is threshold and 𝐶𝐿 is

the correctly labeled batch data. A formal definition of this algorithm given below:

Algorithm0 for eliminating noisy image using AL

41
Input: Unlabeled batch data, pre-trained model trained with provisionally labeled dataset
Output: Labeled batch data 𝐶𝐿

𝑩𝒆𝒈𝒊𝒏

𝑀  𝑈𝑛𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑏𝑎𝑡𝑐ℎ 𝑑𝑎𝑡𝑎, 𝑡  0


𝑭𝒐𝒓𝒆𝒂𝒄𝒉 𝑖 = 1 𝑡𝑜 𝑁
𝐼𝑓 𝑆𝑆𝐸 (𝐵𝐷) > 𝑇𝐻

𝐵𝐷 = 𝐶𝐿
𝑡𝑡+1
𝑒𝑙𝑠𝑒
𝐵𝐷  𝐵𝐷 – 𝐸𝑖
𝐼𝑓 𝑡 = 𝑛
𝑬𝒏𝒅
𝑬𝒏𝒅
The algorithm described the process of using AL to eliminate a noise image. It is part of the main

algorithm and removes unwanted images which are wrongly labeled with a low evaluation score. One

of the challenges of working with AL is training data of a very small size data and having to rely on

feedback driven queries. In addition, while selecting multiple examples in a batch mode it may select

similar image. Moreover, Dasgupta et al. [84]showed exponential improvement is not always

achievable in AL.

In order to address the above issues, we combined AL and SSL. While active learning focuses on

exploring the unknown aspect, on the other hand, semi-supervised learning considers the unlabeled

data. ASSL leverages the advantages from both active learning and semi-supervised learning. Our

rollback based ASSL iteratively exploits the unlabeled data via both active learning and semi-

supervised learning. One of the main objectives of using ASSL is a smaller amount of labeled data can

42
achieve a similar prediction performance that larger unlabeled data also does. Our experiment results

show that ASSL achieves a better performance than conventional learning techniques. Due to that fact,

semi-supervised learning and active learning exploit the unlabeled data in a different way, therefore the

combination of AL and SSL achieve a better performance.

Finally, we evaluate each set of batch data 𝐷𝐵𝑏𝑎𝑡𝑐ℎ against the trained model 𝑀0 and check the

evaluation score we consider it as 𝑇𝐻. Based on the evaluation score, we decide to retrain or eliminate

the 𝐷𝐵𝑏𝑎𝑡𝑐ℎ image sample. We keep retraining 𝐷𝐵𝑏𝑎𝑡𝑐ℎ until it reaches towards the convergence. In

this way we get the optimally labeled image 𝐷𝐵𝑜𝑝𝑡𝑖𝑚𝑎𝑙 and the optimized model 𝑀𝑜𝑝𝑡𝑖𝑚𝑎𝑙 . Moreover,

depending on the 𝐷𝐵𝑜𝑝𝑡𝑖𝑚𝑎𝑙 evaluation score we decide when to utilize the rollback option and which

the 𝐷𝐵𝑜𝑝𝑡𝑖𝑚𝑎𝑙 dataset to avoid.

3.2 Pre-trained Neural Network

In this section we present the network architecture use in our experiments. The primary benefit of using

a deep convolutional network in our method is that instead of training the network from scratch, we

can exploit the already trained with large dataset pre-trained network model. We use VGG16 pre-

trained model from the Oxford Visual Geometry Group [20][33] and Inception model [59] in our

experiment. The sample input images are carried through the CNN network. Our network architecture

uses the similar principles inspired by Simonyan & Zisserman et al [28]. First, each pre-processed

image of size (𝐻 × 𝑊) is determined as the input of the network. The rescaled image is carried through

the number of convolutional layers where the filter size is small only 3 × 3 . The padding used here is

1 pixel for every 3 × 3 convolutional layers. Pooling is completed by five max-pooling layers. Let

convolution transpose by 𝑦 to get from 𝑥 and here 𝐻 ′ is the length of the filter[35].

43
max
𝑌𝑖" 𝑗" 𝑑 = 1≤𝑖 ′ ≤𝐻 ′ ,1≤𝑗 ′ ≤𝑊 ′ 𝑥𝑖" +𝑖′ −1,𝑗" +𝑗′ −1,𝑑. (3.2)

For max pooling stride is 2 and window is 2 × 2 pixel. The first two fully linked layer are 4096

channels each and the last fully connected layer has 1000 channels for classification [1]. The final layer

known as the softmax layer. Softmax operator [35] can be calculated as follows:

𝑥
𝑒 𝑖𝑗𝑘
𝑦𝑖𝑗𝑘 = 𝑥𝑖𝑗𝑡 (3.3)
∑𝐷
𝑡=1 𝑒

All hidden layers are equipped with the Rectified Linear Unit (RelU) activation function. ReLU

operator can be stated as the matrix notation

𝑦𝑖𝑗𝑑 = max{0, 𝑥𝑖𝑗𝑑 }. (3.4)

44
3.3 EER-ASSL Learning

Figure 3.1: Expected Error Reduction -Active Semi-Supervised Learning architecture.

The proposed method utilizes collaborative sampling in EER-ASSL. The EER-ASSL incorporates the

rapid modeling capability from the EER method[86][87] and incremental learning capability of CNN

[88]. The method takes advantage of the AL algorithm and the bin-based SSL algorithm with the

rollback functionality. The method minimizes the training time and at the same time keeps a high-

quality labeled dataset and high-accurate person detector in an adaptive learning process [89]. The

45
collaborative sampling method can select more informative and reliable samples with low redundancy.

Since the criterion of uncertainty can trigger the selection of noisy or redundant samples, the diversity

criterion is implemented in the clustering based redundant sample removal algorithm[88].

This brief sketch of the EER-ASSL method utilized for rapid adaptive object detection in a

dynamically changing environment. A batch of samples is selected based on uncertainty and diversity

sampling from an image stream, instead of a single image at a time. After collaborative sampling is

finished, in the next step samples are divided into bins for the rollback SSL algorithm. Each bin consist

of unlabeled samples. The pseudo labeled by the bin-based SSL using both the CNN model and EER-

based rollback learning method. In many cases the pseudo labeled dataset contains incorrectly labeled

samples or biased labels. The noisy samples should be excluded from the confident samples since such

samples do have a harmful effect to building a better person detector. The proposed EER-based

learning method is adopted to apply rapid forward and rollback learning and this lead to more

informative sampling and learning by the act of reselection and relabeling.

We use a very limited number of labeled data for training the EER model and the fine-tuned CNN

model. The fine-tuned CNN model and EER model are built using the limited labeled data samples.

The ensemble network consists of the CNN detector and the EER model in order to conduct person

detection. The incremental ASSL is incorporated, where a batch of data samples are collected from an

input data stream, and selectively sampled by the collaborative sampling algorithm[88]. We partition

the selected samples into number of bins. The initially labeled dataset and the pseudo training dataset

by the CNN and EER ensemble are used for training new CNN and EER models, i.e., and these are

used to update the models for the next bin cycle. The new EER model is involved in the rollback

46
learning process which consists of the removal, relabeling, and reselecting of samples from the bin, if

necessary. The bin-based incremental learning is processed with the forward learning for the sample

reselection, and the rollback learning for the removal and relabeling. The new CNN model is also used

in the process of the new collaborative sampling. The process is repeated until convergence.

3.4 Forward learning process

The expected error reduction (EER) method that popular in pattern classification problems

[86][87][90][91] is deployed in EER-ASSL for improved performance. The main goal of the EER

method is to select samples that can reduce generalization error in the next step. Because we cannot

check the testing dataset in advanced a portion of the validation dataset is used for calculate the future

error. The future errors are approximately calculated similarly using the expected log-loss over the

unlabeled data. Let 𝑳 = {(𝒙𝒊 , 𝒚𝒊 ) }𝒎 𝒏


𝒊=𝟏 represent labeled training dataset, and U = {𝒙𝒊 }𝒊=𝒎+𝟏 is an

unlabeled data set, where p ≪ 𝒒 . If a selected sample x is labeled y, and added to L, it is denoted by

𝑳+ = 𝑳⋃(𝒙, 𝒚). Let 𝒈𝑳 denote the EER model from 𝑳 and 𝒈𝑳+ from 𝑳+ . we adopt the log loss, EER

that satisfies the following equation:

𝑥 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥∈𝑈𝐿 ∑ 𝑃(𝑦|𝑥; 𝑔𝐿 ) ×
𝑦∈𝐶

(− ∑𝑥 ′ ∈𝑈𝐿,𝑦′ ∈𝐶 𝑃(𝑦 ′ |𝑥 ′ ; 𝑔𝐿+ )𝑙𝑜𝑔(𝑦 ′ , 𝑥 ′ , 𝑔𝐿+ )), (3.5)

where 𝑪 represent the person classes, the first term 𝑷(𝒚|𝒙; 𝒈𝑳 ) denotes the label information of the

current model, and the second term is the sum of the expected entropy on the unlabeled data U with the

model 𝒈𝑳+ . Eq. (3.5) represents the serial mode learning process and updated immediately after the

labeling of each new data sample in U Eq. (3.5) is rewritten bearing in mind that bin 𝑩𝒊 as follows:

47
𝒙∗𝑩𝒊 = 𝒂𝒓𝒈𝒎𝒊𝒏𝒙∈𝑩𝒊 ∑ 𝑷(𝒚|𝒙; 𝒈𝑳 ) ×
𝒚∈𝑪

(− ∑𝒙′ ∈𝑩𝒊 ,𝒚′ ∈𝑪 𝑷(𝒚′ |𝒙′ ; 𝒈𝑳+ )𝒍𝒐𝒈(𝒚′ , 𝒙′ , 𝒈𝑳+ )) (3.6)

Here, the first part represents the label information of the current model, and the second term

represents the sum of the expected entropy on the unlabeled data bin 𝑩𝒊 with the model𝒈𝑳+ .

The forward learning process mainly take care of the bins for reselection and retraining. Later

reselected samples are put into the current labeled dataset in order to retain CNN model for the bin-

based SSL.

3.5 Rollback learning process

The goal of rollback learning process is find out the uncertain label samples that hamper the current

model performance and replace that sample with a new sample or relabeling that sample. Moreover

rollback learning has the ability to select the particular model based on the evaluation score on the

other hand feedback learning only can acknowledge the outcome. Moreover, rollback has the ability to

comeback to it’s previous state. We use the EER sampling in order to minimize the expected entropy

over pseudo labeled data samples. Considering the computation time, we only inspect the most recent

pseudo labeled dataset instead of the entire dataset. Using the rollback process we remove or relabeled

pseudo labeled samples. The rollback learning process conducts the certification of the labels of the

rollback sample by relabeling it or reselecting from its neighborhood. The rollback learning has two

obligations: 1) unnecessary label removal process and 2) the relabeling process.

48
3.5.1 Removal process

The main objective of EER rollback learning is to minimize the expected entropy. Rollback learning

can be divided into two steps a) removing unnecessary pseudo labels that hampers the EER model

performance and b) relabeling the samples for update the model. The rollback removal process is

formulated as follows:

𝑥 † = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥∈𝐿 ∑ 𝑃 (𝑦|𝑥; 𝑔𝐿\(𝑥,𝑦† ) ) ×


𝑦∈𝐶

(− ∑ 𝑃 (𝑦 ′ |𝑥 ′ ; 𝑔𝐿\(𝑥,𝑦† ) ) log (𝑦 ′ , 𝑥 ′ , 𝑔𝐿\(𝑥,𝑦† ) )) (3.7)


𝑥 ′ ∈𝑈𝐿,𝑦 ′ ∈𝐶

Here, 𝐿\(𝑥, 𝑦 † ) denotes the labeled dataset. As the Eq. (3.7) requires a heavy computation time is not

computable in practice, rollback samples, are selected from the pseudo labeled data samples of the

current step.

3.6 EER-ASSL Algorithm

In the AL process, a batch of data samples is collected from an input data stream, processed by the

collaborative sampling algorithm for the informative samples with minimum redundancy and then

partitioned into bins. The EER combined with bin-based SSL for rapid adaptive learning. Instead of

using the large data a limited number of labeled samples are used create CNN model. If the

performance of CNN is hampered in learning, the EER method is applied for rollback learning. The

labeled dataset is denoted as 𝐿𝐷 is enlarged by adding the pseudo labeled data samples. The increase

LD is used by CNN and EER models. The process is repeated until convergence. Therefore, EER

49
rollback model provides a rapid short- term adaptation, and a confident and the CNN detector model an

incremental long-term performance improvement.

Let 𝐷𝑑𝑖𝑣 denote the samples mined from the current batch after the collaborative sampling. The

detailed discussion can be found in[88] . The rollback bin-based SSL algorithm explained below. We

consider 𝐷∆ indicate the confidential batch dataset for the bin-based SSL. If the cardinality of 𝐷∆

becomes confidence parameter 𝛾, the confidence sample selection process is stopped. 𝐷∆ is initialized

with a sample that satisfies 𝑥𝑡𝑜𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥 ∈ 𝐷𝑑𝑖𝑣 𝑓(𝑥), 𝑥𝑡𝑜𝑝 ∈ 𝐷𝑑𝑖𝑣 . The confidential sampling

strategy selects a sample from 𝐷𝑑𝑖𝑣 and adds to 𝐷∆ according to the distance metric of the current deep

feature space using 𝑥𝑡𝑜𝑝 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥 ∈ 𝐷𝑑𝑖𝑣 {𝑚𝑎𝑥𝑥𝑖, 𝑥𝑗 ∈𝐷∆ 𝑑(𝑥𝑖 , 𝑥𝑗 )} , where 𝑑(𝑥𝑖 , 𝑥𝑗 ) is Euclidian

distance between two samples 𝑥𝑖 𝑎𝑛𝑑 𝑥𝑗 in the deep feature space. The CNN is retrained using the bin

sequence from the confidential samples in 𝐷∆ . The confidential samples are partitioned into the bins,

and stored in bin pool denoted by 𝑩(= {𝐵𝑗 }𝑗∈𝑩 ) or = {𝐵0, . . ., 𝐵𝑗, . . . 𝐵𝐽 } .

In each rollback bin-based SSL step, the confidence scores are assigned to the pseudo samples by the

current CNN detector. The labeled data 𝐷0 is used to initialize CNN detector model 𝑓0 and EER model

𝑔0 in the beginning, respectively. 𝐴𝑐𝑐0 is calculated by 𝑓 0using the validation data. For each bin, we

𝐵 𝐽
build the CNN models {𝑓0 𝑗 }𝑗=1 using 𝐷0 ∪ 𝐵𝑗 , respectively. Let 𝐴𝑐𝑐1 indicate the maximum

𝐽 𝐵 𝐵
accuracy among the scores of the bins calculated by {𝑓0 𝑗 }𝑗=1 , i.e., 𝐴𝑐𝑐1 = 𝑚𝑎𝑥{𝐴𝑐𝑐0 𝑗 } . If the
𝐵𝑗


performance improved, i.e., 𝐴𝑐𝑐1 ≥ 𝐴𝑐𝑐0 , in next step by updating, 𝐷1 = 𝐷0 ∪ 𝐵∗ and 𝑓1 = 𝑓0𝐵 . At

𝐵 𝐽
time step i, for each bin, build the CNN models {𝑓𝑖 𝑗 }𝑗=1 using 𝐷𝑖 ∪ 𝐵𝑗 , respectively, and 𝐴𝑐𝑐𝑖+1 =

𝐵
𝑚𝑎𝑥{𝐴𝑐𝑐𝑖 𝑗 }. The cases are divided into three: Case 1) 𝐴𝑐𝑐𝑖+1 ≥ 𝐴𝑐𝑐𝑖 , Case 2) 𝐴𝑐𝑐𝑖 − 𝜏 < 𝐴𝑐𝑐𝑖+1
𝐵𝑗

𝐴𝑐𝑐𝑖 , and Case 3) 𝐴𝑐𝑐𝑖+1 ≤ 𝐴𝑐𝑐𝑖 − 𝜏, where 𝜏 is a tolerance threshold for an exploration potential.

50

Case 1: we get the best bin for the next step and update 𝐷𝑖+1 = 𝐷𝑖 ∪ 𝐵∗ and 𝑓𝑖+1 = 𝑓𝑖𝐵 ; 𝐵𝑖 = 𝐵∗ ;

𝑩 = 𝑩\ 𝐵𝑖 . The bin pool 𝑩 is reduced by removing the selected bin.

Case 2: we conduct the following sub-steps: 1) find the removal samples from 𝛥𝑖 using the rollback

learning process using Eq. (3.11), 2) find the relabeling samples, and assign them new labels from 𝛥𝑖

using the rollback learning process based on Eq. (3.9), and 3) update 𝛥𝑖 by reselection using the EER

forward learning process based on Eq. (3.7).

The above the forward-rollback learning processes are repeated, until the condition of 𝐴𝑐𝑐𝑖+1 ≥ 𝐴𝑐𝑐𝑖

or 𝐴𝑐𝑐𝑖+1 ≤ 𝐴𝑐𝑐𝑖 − 𝜏 or a time limit. If the condition 𝐴𝑐𝑐𝑖+1 ≥ 𝐴𝑐𝑐𝑖 is satisfied, we update 𝐷𝑖+1 =

𝛥 𝛥
𝐷𝑖 ∪ 𝛥𝑗 , 𝑓𝑖+1 = 𝑓𝑖 𝑗 , 𝑔𝑖+1 = 𝑔𝑖 𝑗 , and 𝑩 = 𝑩\ 𝐵𝑖 .

Case 3: oracle labels incorrectly labeled data in 𝐵∗ , and update 𝑓𝑖+1 = 𝑓𝑖 , 𝑔𝑖+1 = 𝑔𝑖 , 𝐷𝑖+1 .

The rollback process of Case 2 can reduce significantly the oracle labeling steps. (𝐷𝑖 ∪ ∆𝑖 ) is used to

build a training data set 𝐷𝑖+1 , which is used for training 𝑓𝑖+1 and 𝑔𝑖+1 at time t. The process is repeated

until convergence. Finally, the rollback bin-based SSL produces the two models f and g, and enlarged

labeled dataset LD. The combination the EER based rollback learning and the bin-based SSL allows

obtaining a rapid adaptive object detector, even from the noisy streaming samples under a dynamically

changing environment. The rollback bin-based SSL algorithm is summarized in Algorithm 1.

Algorithm 1. Rollback bin-based SSL


Input: bin pool 𝑩
Output: CNN model f, EER model g, and labeled dataset LD.
Repeat until 𝑩 ≠ 𝜙
1. For each bin, 𝐵𝑖 ∈ 𝑩, build 𝑓𝑖+1 using 𝐷𝑖 ∪
𝐵
𝐵𝑖 , and calculate 𝐴𝑐𝑐𝑖 𝑖 .
𝐵
2. 𝐴𝑐𝑐𝑖+1 = max{𝐴𝑐𝑐𝑖 𝑗 } .
𝐵𝑗
3. If 𝐴𝑐𝑐𝑖+1 ≥ 𝐴𝑐𝑐𝑖 ,

51
𝐵
𝐵 ∗ = argmax{𝐴𝑐𝑐𝑖 𝑗 }, 𝐷𝑖+1 = 𝐷𝑖 ∪ 𝐵 ∗ ;
𝐵𝑗
𝐵∗
𝑓𝑖+1 = 𝑓𝑖 ; 𝐵𝑖 = 𝐵 ∗ , and 𝑩 = 𝑩\ 𝐵𝑖 .
4. Else if 𝐴𝑐𝑐𝑖 − 𝜏 < 𝐴𝑐𝑐𝑖+1 < 𝐴𝑐𝑐𝑖 ,
While 𝐴𝑐𝑐𝑖 − 𝜏 < 𝐴𝑐𝑐𝑖+1 < 𝐴𝑐𝑐𝑖 ,
4.1 Remove the samples from Δ𝑖 , i.e., the
removing rollback process using Eq. (5).
4.2 Relabel the samples in Δ𝑖 i.e., the relabeling
rollback process using Eq. (7).
4.3 Reselect the samples from 𝐵𝑖 using the
forward learning process using Eq. (3).
4.4. If 𝐴𝑐𝑐𝑖+1 ≥ 𝐴𝑐𝑐𝑖 , 𝐷𝑖+1 = 𝐷𝑖 ∪ Δ𝑗 ,
Δ Δ
𝑓𝑖+1 = 𝑓𝑖 𝑗 , 𝑔𝑖+1 = 𝑔𝑖 𝑗 , and 𝑩 = 𝑩\ 𝐵𝑖 . i++
Else if 𝐴𝑐𝑐𝑖+1 < 𝐴𝑐𝑐𝑖 − 𝜏 or time limit, oracle
labels incorrectly labeled data in 𝐵 ∗ .
𝑓𝑖+1 = 𝑓𝑖 , 𝑔𝑖+1 = 𝑔𝑖 ,𝐷𝑖+1 , i++.
Return {𝑓 = 𝑓𝑖+1 , 𝑔 = 𝑔𝑖+1 , 𝐿𝐷 = 𝐷𝑖+1 }

3.7 Summary

In this chapter, we have presented the Rollback based ASSL method in detail with expected error

reduction sampling. The EER-ASSL utilizes active learning and forward and rollback learning. The AL

allows to select more informative samples measuring uncertainty and diversity. We have described

how Semi supervised learning and Active Learning play a major role and while Rollback addresses the

critical issues of performance improvement. In the next chapter, we described our algorithm applied for

single person action recognition and two person interaction recognition.

52
Chapter 4

Human Action Recognition

In this chapter, we present in detail the techniques about general ASSL that applied in single person

action recognition, two person interaction, and EER-ASSL based action recognition. Human action

consists of recognizing action from still images where simple day to day activities are considered such

as walking, sitting, taking a photo, using computer etc. These type of action may include situations

where the observation of people is necessary such as pedestrian crossing the street, hospital patient in

need of care or some restricted area that requires monitoring. We further explain two types of static

human action recognition such as one person action recognition and two person interaction recognition

with the Rollback based ASSL framework. Finally, we discuss the chapter summary in section 4.3.

4.1 One Person Action Recognition

Rollback based ASSL that classify single person action recognition that we can see in figure 4.1. Each

step is related to architecture given below.

53
Figure 4.1: architecture for one person action recognition

In our method we demonstrate robustness by utilizing deep features incorporated with active semi-

supervised learning explained below. The normalization of intensity [31] and cropping play a major

role in reducing noise in our method. At the beginning of our experiment, we get rid of unnecessary

portions of the image that is not useful for human action, and only preserve the ROI of human

interaction. We normalized the image to size 224 × 224. The same human action may varies in the

feature vector therefore Image intensity and differences play a significant role for action recognition.

Consequently, we apply intensity normalization to tackle this problem.

We minimize lighting interference by altering the pixel intensity variety, which helps to reduce contrast.

Normalization transforms [90] a color image to a grayscale image, 𝐼, with the intensity value range

(𝑀𝑖𝑛 , 𝑀𝑎𝑥 ):

𝐼: {𝑋 ⊆ 𝑅 𝑛 } → {𝑀𝑖𝑛, . . , 𝑀𝑎𝑥} (4.1)

54
into a new image where the intensity values change to the range (𝑛𝑒𝑤𝑀𝑖𝑛, 𝑛𝑒𝑤𝑀𝑎𝑥).

𝐼𝑛 : {𝑋 ⊆ 𝑅 𝑛 } → {𝑛𝑒𝑤𝑀𝑖𝑛, . . , 𝑛𝑒𝑤𝑀𝑎𝑥} (4.2)

We eliminate the illumination interference via intensity normalization using the Gaussian weighted

average [32] of light intensity. The work explained below. Initially, each pixel value is deducted from

its neighbor Gaussian-weighted average. Second, each pixel is split by the neighborhoods standard

deviation. Equation (4.2) represents the pixel value calculation of intensity normalization:

𝑋−𝜇𝑛ℎ𝑔𝑥
𝑋′ = 𝜎𝑛ℎ𝑔𝑥
(4.3)

where 𝑋 ′ represents a new pixel value, 𝑋 is the unusual pixel value, 𝜇𝑛ℎ𝑔𝑥 is the Gaussian weighted

average of the neighbor pixel of 𝑋, and 𝜎𝑛ℎ𝑔𝑥 is the standard deviation of the neighbors of 𝑋.

4.1.1 One Person Action recognition analysis

We propose an active semi-supervised learning (ASSL) framework for action recognition from video

and single-picture data, as shown in figure 4.1. We anticipate action-specific person detection, where

each image contains personal information with bounding box data information stored in an extensible

markup language (XML) file. We trained the image dataset and made our initial model. If the initial

model did not predict the right action verified by SSL process, we employed an active learning method

to increase recognition performance.

55
Our proposed algorithm for single person action recognition given in the below section. This algorithm

performs well with local and benchmark dataset. The input images are unlabeled and ordered by batch

size for sequential training. The VGG16 [28] and Inception [59] pre-trained model is used with a

RMSProp [31] and Adam optimizer [30] in order to train the action dataset for the initial model. The

current batch of images are combined with the previous batch. We apply AL where the threshold value

is over 0.9. Finally, the Action Recognition (AR) classification model performance is observed and if

the performance is not satisfactory we consider an object detection model. If we correctly detect the

object then we apply ensemble learning in order to improve the classification of human action and to

reduce the likelihood of an unlucky selection of a poor one. Ensemble learning usually takes multiple

models as input and integrates them to build a predictive decision model[66]. In our algorithm,

ensemble learning combines the decisions of AL and object detection and training datasets and then

improves prediction performance. We considered a rollback procedure to obtain better results and to

reduce the amount of inconclusive data. In the rollback procedure, we compare the current batch results

with the previous batch results. If the current batch results show improvement, we do nothing; but if

the current batch performs poorly, we rolled back to the previous result. As a consequence, a new batch

of images are replaced with the current batch.

56
4.2. Two person interaction recognition

The flow diagram of our framework shows in figure 4.2 where a video frame is transformed into an

image sequence, and then noise is removed, and the action area Region of Interest (ROI) is selected to

training and test datasets.

Figure 4.2: Block diagram for two-person action recognition

4.2.1 Two person interaction recognition Analysis

The proposed ASSL method is represented for interaction classification. We denote

𝑈𝑛𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝐷𝑎𝑡𝑎 as 𝑈𝐷𝑖 .The training datasets labels are known as 𝑃𝑟𝑒𝑡𝑟𝑎𝑖𝑛𝑒𝑑 𝑀𝑜𝑑𝑒𝑙 (𝑀𝑖−1 ). The

VGG16 model is training with dataset and that denote the initial model as, 𝑀0 . 𝑁 refers to the

maximum number in the dataset. Each unlabeled dataset is trained with model 𝑀0 , then initializes to

𝐿𝐷𝑡𝑒𝑚𝑝 . It should be noted that in the next step, C𝑢𝑟𝑟𝑒𝑛𝑡 𝐷𝑎𝑡𝑎𝑠𝑒𝑡 (𝐶𝐷0 ) is combine with 𝐿𝐷𝑡𝑒𝑚𝑝 and

assigned to 𝐶𝐷𝑡𝑒𝑚𝑝 . It should be noted that AL is applied in the second step, if the limit of threshold

exceeds 0.9, and it is denoted by 𝐿𝐷𝑖 . Subsequently, 𝐶𝐷𝑖−1 is joined with 𝐿𝐷𝑡𝑒𝑚𝑝 and trained with

model 𝑀𝑡𝑒𝑚𝑝 . Finally, model 𝑀𝑡𝑒𝑚𝑝 is compared to the previous model, and if the current model’s

57
performance is higher than the previous model then training continues otherwise, the dataset is rolled

back. The roll back procedure helps to obtain a better performance. If the current batch of data

performs not well then we can come back to the previous better performed model. It is one of the

important feature of rollback learning that replace that current batch data with the noise free data.

4.3 EER-ASSL based Human Action Recognition

Figure 4.3: Flow diagram of EER-ASSL based Human action recognition

58
4.3.1 Person Detection
Many of the CNN based person detectors are designed for static data distribution and unable to handle

drift, fast motion and occlusion problem during person detection in the real world environment.

Moreover, in a dynamic environment with complex setting person detection becomes challenging due

to aspect ratio difference and vanishing problem. Our proposed EER-ASSL (discussed in chapter-3)

based method can overcome this challenge where the prevailing person detector is applied in order to

get the desired detection result. When the person shows similar appearances, a possible occlusion is

investigated. We adopted state of the art object detector YOLOv2[92] in our EER-ASSL method that

take advantage of person detection. Most of the common detector trained their detector with thousands

of images in order improve the detection performance. But, labeling images for detection is very

expensive[92].

4.3.2 Human action smoothing and classification

Necessity of SSL and AL in our method:

Initially, person detection is applied to find out whether there is a human is exist in the scenario. Since,

labeling image for classification is expensive but unlabeled data carry almost no information related to

human action recognition. Therefore, in order to improve the person detection accuracy we require

SSL algorithm. Among those large volume of samples we need to find out way to gather informative

samples that can contribute the EER-ASSL performance. Due to the fact we relied on sampling

strategy called collaborative sampling that can reduce the sampling bias. Sampling bias or wrongly

labeled sample can hamper the detection performance. Therefore, AL helps our proposed system to

overcome from wrongly labeled samples by relabeling. Moreover, expected error reduction sampling

59
becomes handy when using only a subsample of few hundred training examples are used to prepare the

EER model and reduce the generalization error.

The selected samples are divided into number of bins in order to process the data. In each bin we apply

AL then train and evaluate. After that bin1 is carried forward to bin2. This entire process continues

until bin no N. If the bin 1 model performance is better than the next bin we consider this as forward

learning otherwise it is considered rollback learning and that leads to the skipping of the poor scored

bin. The entire process continues until convergence is achieved. The EER-ASSL incremental learning

that includes forward learning (discussed in chapter-3, section 3.4) and backward learning (discussed in

chapter-3, section 3.5) in support with active semi-supervised learning in order to fine tune and update

the model efficiently.

Human action, such as sports activity or real world event’s contains abrupt changing in environment by

comparing the localized bounding boxes. If there is drift detected then we estimate the smoothing for

validation and apply EER-ASSL updated model (shows in figure 4.3) to overcome the drift problem.

In this way we select the best possible model for person’s action detection. In the next step we extract

the person features from input image and applied smoothing for improved human action classification

performance.

60
4.4 Summary

In this chapter, we have presented the methods of human action recognition in detail with the rollback

based EER-ASSL method being used. For the accurate human action recognition image pre-processing,

normalization, sampling strategy such as EER sampling and deep CNN training plays important role.

We have described in detail how semi supervised learning and active learning used together with

rollback address the critical issues of human action recognition and help with performance

improvement. The experiment results validate our method which we will show in detail in the next

chapter.

61
Chapter 5

Experiments on Action recognition

In this Chapter we mainly discuss the different datasets we have used in our experiment such as ITLab

and benchmark dataset e.g. PASCAL VOC, Stanford 40, UT-Interaction, KTH and HMDB. In ITLab

dataset we have shown single person action recognition and two person interaction recognition in

different environments such as simple and complex. Later we have included EER-ASSL based action

recognition. We have presented how our method outperforms the other state of the art methods using

the Effective Adaptive Deep Learning (EADL) algorithm and Effective Hybrid Learning (EHL)

algorithm we mentioned in Chapter-3. Later we showed the evaluation results and compared them to

other state of the art methods.

5.1 Single person action recognition


5.1.1 Dataset Overview

The datasets for human action recognition are different from each other due to dissimilar annotation

structures. At the same time, manual annotation with ground truth is time-consuming and tedious work.

Some of the activity recognition datasets available are listed below. We conduct our experiment on

three datasets: the PASCAL VOC Action [93] , Stanford 40 Actions [94], and ITLab action dataset.

5.1.2 PASCAL VOC Action dataset

A total of 10 classes of human action are included in the PASCAL VOC action dataset such as

Jumping, Phoning, Playing an Instrument, Reading, Riding a Bike, Riding a Horse, Running, Taking a

62
Photo, Using a Computer, and Walking. Among them we selected the five following actions for our

experiment which is similar to the ITLab action dataset in order to do a better evaluation. The training

and validation data has 11,530 images containing 27,450 ROI annotated object. For this experiment we

only selected a limited number of action image.

(a)Jumping

(b)Phoning

(c)Taking photo

(d) Using a Computer

63
(e) Reading
Figure 5.1: Five example of action categories: (a) to (e), respectively, jumping, phoning, taking a photo,

using a computer, and reading, from the PASCAL VOC 2012 action recognition dataset. These classes

illustrate human actions strongly trained on a pose.

5.1.3 Stanford 40 Actions dataset

There are 40 different human actions that are included in the Stanford dataset. A total of 9532 images

in this dataset with 180 to 300 images per action class. These are applauding, playing a violin, blowing

bubbles, pouring liquid, brushing teeth, pushing a cart, cleaning the floor, reading, climbing, riding a

bike, cooking, riding a horse, cutting trees, rowing a boat, cutting vegetables, running, drinking,

shooting an arrow, feeding a horse, smoking, fishing, taking a photos, fixing a bike, texting a message,

fixing a car, throwing a Frisbee, gardening, using a computer, holding an umbrella, walking a dog,

jumping, washing dishes, looking thru a microscope, watching TV, looking thru a telescope, waving

hands, phoning, writing on a board, playing a guitar, writing in a book. Among them we only select

five actions which are similar with the IT lab dataset. In order to train the network we use 1160 images

and for testing 66 images.

64
(a)Jumping

(b)Phoning

(c) Reading

(d)Taking photo

(e) Using a computer


Figure 5.2: Five example action categories: (a) to (e), respectively, jumping, phoning, reading, taking a

photo and using a computer from the Stanford 40 action dataset. Amongst 40 actions, five actions are

selected.

65
5.1.4 IT Lab Action dataset
The Intelligent Technology Lab action dataset consists of five different human actions such as

jumping, phoning, reading, taking a photo and using a computer in two environments. We needed to

train the IT Lab dataset, which consists of 1123 images where the actions are labeled with bounding

box information. We considered two environments for data gathering: simple with a solid background

and cluttered with a noisy background. We made 20 video clips where each action is represented. We

extract 1000 JPEG images from each video clip. Each batch size has 50 images which includes 10

images for each action class.

Using computer Jumping Reading Phoning Taking photo

Figure 5.3: IT Lab action dataset with simple backgrounds.

66
Using Computer Jumping Reading Phoning Taking a photo

Figure 5.4: IT Lab action dataset with complex backgrounds. Here backgrounds are cluttered with

different objects.

5.1.5 Experiment Setup


In this section, we describe details about our experiment on action recognition. We use the CNN

architecture [95], which work extremely well in object classification in the ImageNet challenge [40].

We construct a VGG16 network [20], which showed a remarkable improvement over previous state-of-

the-art methods [96]. Our experiment use VGGNet network [20] that has five convolutional layers and

three fully connected layers and Inception network [59]. We use the Faster RCNN [34] object detector

for person detection, and our framework implement on deep learning library Caffe [97]. All

implementations were on a single server with a CUDA deep neural network (cuDNN) [98] and a single

NVIDIA GeForce GTX 970. We use Matlab 2017b with the matconvnet [99] library for both Windows

7 and Ubuntu 14.4 operation systems in our experiment. We train our initial model for 50k iterations

with batch size 8 and a learning rate of 0.001.

67
5.1.6 Evaluation
The actions categorized as work presume knowledge of the ground truth location of the person

throughout the test time. For action classification evaluation, we used the evaluation criteria defined by

the PASCAL VOC action task, which computes the Average Precision (AP) on the ground truth test

boxes. For the evaluation, we consider five kinds of human actions: Jumping, Phoning, Reading,

Taking Photo, and Using Computer, where the background is solid and noise-free known as simple

environment. We consider a threshold value of 0.9 while doing AL. We evaluate the propose

framework via human action recognition from the IT Lab action dataset, the Stanford 40 Actions

dataset and PASCAL VOC 2012. The IT Lab dataset contains these five actions in two different

environments; one with simple backgrounds (free from noise) and the other with complex backgrounds

(cluttered environments)

Table 5.1 shows action-recognition image accuracies in the IT Lab environment. The performance

improves with a higher number of images. In our experiment, the correctly classified human actions,

such as jumping. “Taking a photo” had the worst detection rate, compared to the other actions due to

similarity with phoning.

Table 5.1: Number of images used for training on the IT Lab Action dataset

Total Average Total Average Accuracy


Accuracy
Training 502 94% 621 96%
Images
Test Images 50 50

Table 5.1 shows the number of images used for IT Lab dataset training on human action recognition, as
well as testing.

68
Figure 5.5: Human action recognition performance on the IT Lab dataset, which uses both semi-

supervised learning (SSL) and active learning (AL) steps. In our experiment, each step consist of two

sub-steps: AL and SSL. Figure 5.5 shows that SSL attained 56% accuracy, while AL attained 88%

accuracy. In step two, with a combine dataset with more iterations, SSL and AL got 92% accuracy.

Finally, in step three, both SSL and AL achieved 96% accuracy.

69
Fig. 5.6 shows the number of epochs with training and validation errors.

Figure 5.6: Shows the parameter performance which includes learning rate, number of epochs and

ranges when train the network.

Figure 5.7: Shows the performance with less number of iteration.

70
Here for the second dataset performance came around 92%. We generally use 50 iteration, here only 5
iteration used that reduce the batch dataset training time.

5.1.7 Benchmark Dataset Comparison

Table 5.2 shows a comparison with state-of-the-art results from the PASCAL VOC 2012 dataset for

action detection. Our approach shows a significant gain over the best results reported in the literature.

Table 5.2: Performance with five different actions from the PASCAL VOC 2012 dataset
Methods Phoning Reading Taking a Using a Jumping mAP
photo computer
RCNN [100] 72.6 74.0 83.3 87.0 88.7 84.9
Action Mask [101] 72.1 69.9 73.3 92.3 85.5 82.2
R*CNN [62] 79.9 82.2 85.3 94.0 88.9 87.9
Whole & Parts[102] 61.2 66.7 74.7 79.5 84.5 80.4
Part based Network [103] 86.1 87.4 86.5 92.4 88.2 88.8
Part Action Network [103] 86.9 88.5 87.5 92.4 89.6 90.0
Our approach 88.0 92.0 88.0 94.0 96.0 91.6

Our method perform poorly on phoning and taking photo due to human action similarity, however,

gained better performance on Using computer, Reading and Jumping.

71
Figure 5.8: Human Action performance analysis comparing the proposed Effective Adaptive Deep

Learning with the baseline results on PASCAL VOC action dataset.

In figure 5.8, we show the performance difference of the baseline and Effective adaptive Deep

Learning method.

Each group of bars shows the performance of Human Action Recognition on PASCAL VOC dataset

for five actions. Each group shows two bars corresponding to baseline accuracy and the proposed

Effective adaptive Deep Learning method’s accuracy. On top of the single CNN model, we include

ensemble learning to further enhance performance, as the single model failed to obtain satisfactory

performance. In our case, we kept a discrete model for objects and humans. In order to obtain

combined performance, we independently trained multiple differently initialized CNNs and output

their training responses. To train the ensemble weights, a loss was defined as the weighted ensemble

response, and thus, weights were optimized to minimize loss, as shown in figure 4.6 for testing, the

learned weight is also used to compute the ensemble test response.

Table 5.3: comparison between the IT Lab dataset and the Stanford 40 Actions dataset.
Approach Accuracy (%)
Test set (50,0) Test set (50,16) Test set (50, 33) Test set (50,49) Test set (50, 66) Test set (0,66)

VGG 0.96 0.803 0.723 0.58 0.552 0.212


[ADAM]
VGG 0.96 0.803 0.723 0.636 0.586 0.27
[RMS]
Inception 0.96 0.818 0.675 0.626 0.552 0.212
[RMS]

In Table 5.3, Test set (P, Q) means the test data that is combined with P from the IT Lab action dataset

and the Q test data from the Stanford 40 Actions dataset. Here, the IT Lab dataset is considered as a

simple background, which means no cluttered background. The number of training data is 200, and the

number of test data is 50, with 10 images for each action. Similarly, we use the Stanford 40 Actions

72
dataset where training data is total 1160 for five actions, and the test data total 66. Our training

parameters were as follows: batch size: 64; total number of epochs: 300; learning rate: 0.001.

The results in Table 5.4 show a comparison of different state-of-the-art approaches using the IT Lab

Action expression dataset and the Stanford 40 Actions dataset. Notice that our method outperforms the

other techniques using the mixed test data from the IT Lab and the Stanford 40 Actions dataset. Our

method gain the lowest accuracy for test set (0, 66) and obtain the best accuracy for test set (50, 0),

which is significantly higher than previous work [103].

Table 5.4: Recent different methods and performance results on the Stanford 40 dataset

Method Mean AP (%)


Image Classification (VGG15 model) 81.4
Zhang et al. [101] 82.6
Yan et al. [96] 85.2
Ours with Inception (Adam) 90.9

Table 5.4 demonstrate the performance of our method on PASCAL VOC 2012 dataset. Here, our

propose method trained with Inception model and Adam optimizer. We compare our results

against[96], which show that the action recognition method with Effective Adaptive Deep Learning

based ensemble learning yield superior performance 90.9 Mean AP on PASCAL VOC 2012 dataset.

5.2 Two person Interaction recognition


The main objectives of our experiments are to prove the efficiency of the framework with different

conditional data of various numbers. We check the performance improvement with additional data in

the training set. Two person interaction recognition experiment carried out on two benchmark such as

UT-Interaction dataset [104], HMDB: a large human motion database [80] and the ITLab action

73
datasets to check our framework’s general performance. We use these two benchmark dataset because

both dataset has similar human actions such as hugging and fighting.

5.2.1 UT-Interaction dataset

There are six categories of human action interactions in the UT-Interaction dataset: shake-hands, point,

hug, push, kick and punch [104].There are total of 20 video clips of approximately 1 minutes length.

Each video clips includes a distinct background, scale and illumination. The hug and punch have

similarity to our work among them. We convert the UT-Interaction dataset video files into image

sequences to train the network models and use 835 images from them.

(a) Punch (b) Shake-hands (c) Hug

(d) Kick (e) Point (f) Push

Figure 5.9: Examples of UT-Interaction dataset.

5.2.2 HMDB: a large human motion database


In this dataset, there are 51 action categories and 6766 video clips exist in this dataset [80]. This dataset

categorized into five classes: 1) General facial actions, 2) Facial actions with object manipulation, 3)

General body movements, 4) Body movements with object interaction, 5) Body movements for human

74
interaction. Among them for experimental evaluation, we consider hug and punch. We use 1133

images from HMDB dataset for training and testing.

(a)Punch (b) Punch (c) Punch

(d)Hug (e)Hug (f) Hug

Figure. 5.10 Example of HMDB dataset where action category is punching and hugging.

5.2.3 ITLab Interaction dataset


The ITLab interaction dataset consist of five different human actions: Hugging, Fighting, Linking

arms, Talking, and Kidnapping. We use two environments for two person interaction recognition data

gathering: relatively simple with a fair background and cluttered background. We recorded 20 video

clips where each action is included. We extracted 1000 JPEG files images from the each video clips.

75
5.2.4 Experiment Setup
We use VGG [20] and Inception [59] pre-trained network that utilize ImageNets for a 1000-category

classification task. We also use adaptive gradient algorithm such as RMSprop[31], Adam[30] for

optimization. We use similar hardware and library environment for experiment as we did for single

person action recognition such as cuDNN (Deep Neural Netowrk Library) [35], Matlab 2017a with the

matconvnet [35] library.

5.2.5 Evaluation
For the evaluation of our framework, we consider five types of actions: Hugging, Fighting, Linking

arms, Talking, and Kidnapping. From those 20,000 video frames, we select noise-free images from the

unlabeled data, and then, incrementally updated the model. We contemplate a threshold value of 0.9 for

AL.

(a)Hugging (b)Fighting (c)Linked arm (d)Talking (e)Kidnapping

Figure 5.11: Two-person action in simple environments of the ITLab action dataset.

We create four training data sets and one test data set for a simple environment. The training dataset

and the test dataset are labeled before training. Performance is shown in figure 5.11, where the initial

training set’s performance is not good (around 44% accuracy) due to noisy dataset. Due to Active

76
learning apply with an updated model, performance is gradually increases. For the second dataset, AL

performance is merely 52%, and similarly, after a third dataset, accuracy reached 78%. After the final

dataset training is complete, the performance attains up to 96%. We find that the best case, with

training and validation set using a simple background, is 98%; however, on average, performance is

96% accurate, where our model reached saturation.

Figure 5.12: The performance chart based dataset with simple-environment.

77
(a) Hugging (b) Fighting (c)Linking arms (d) Talking (e)Kidnapping

Figure 5.13: Two-person action in an ITLab dataset with cluttered environment.

For the complex-background dataset where background is noisy, we consider four training sets and one

test set. The training dataset and test dataset are labeled before training, as with the simple

environment. Performance results are shown in figure 5.13, where the initial training set’s performance

is around 67% accuracy. But after active learning applies to the second set, accuracy increases to 75%,

and gradually reaches 77% for the third set. The performance reaches up to 81% accuracy after the

final dataset training is finished.

Figure 5.14: The performance chart based dataset with complex environment.

We divided the entire simple-environment dataset into k equal-sized subsets. The validation data for

testing the model is retained by a single subset, and the remaining k-1 subsets are used as training data.

78
The cross validation proceeded k times, each of the k subsets being used as validation set, averaged to

produce a single estimation. We acquired the results shown in Table 1 after k-fold cross validation,

where average accuracy is about 95.7%.

Table 5.5: cross validation (k-fold) based on a simple environment dataset setting.

Training set Test set Average accuracy


number

6,7,8,9 10 92%

7,8,9,10 6 96.5%

6,8,9,10 7 94%

6,7,9,10 8 98%

6,7,8,10 9 98%

The corresponding average accuracy is showed in table 5.5 for all four ITLab training datasets.

Figure 5.15: Confusion matrix for two-person interaction in a complex environment

79
Figure 5.15 reflects a confusion matrix for two-person interaction in a complex environment, where the

first line indicates hugging, with true positive classification at 80% and around 20% false positive

classification for kidnapping. The second row means the action of fighting, where 90% is true positive

classification and 10% is false positive for linking arms. The third and fourth rows shows linking arms

and talking, where 100% is true positives.

Table 5.6: Performance of the ASSL framework for two environments.

Training set Test set Environment Performance


number

4 1 Simple environment 96%

4 1 Complex environment 81%

Table 5.6 shows the results of our proposed method, using the ASSL framework for both simple and

complex environments.

5.2.6 Benchmark dataset comparison


Table 5.7: Comparison technique of classification of action with Adam optimizer.

Method UT-Interaction precision Recall F-measure


dataset
Fight 1 0.95 0.97
Handshake 0.97 1.00 0.99
VGG [ADAM] Hug 1.00 1.00 1.00
Kick 1.00 0.96 0.98
Point 1.00 0.97 0.99
Push 0.94 1.00 0.97

80
Table 5.7 shows the efficiency of adaptive the Adam optimizer [30] using VGG model. There are six

action classes in UT-interaction dataset in two different sets such as set 1 and set 2. The precision,

recall and F-measure demonstrate that the our method outperforms other techniques [73].

Table 5.8: Comparison technique classification of action with RMSProp optimizer.

Method UT-Interaction Precision Recall F-measure


dataset
Fight 0.94 0.85 0.89
Handshake 0.94 1.00 0.97
VGG [RMS] Hug 1.00 1.00 1.00
Kick 0.96 0.92 0.94
Point 1.00 0.94 0.97
Push 0.92 1.00 0.96

Table 5.8 displays the efficiency of the adaptive gradient algorithm RMSProp [31] using the VGG

model. Hug and Point performs well among all the human actions. The precision, recall and F-measure

shows effectiveness of our method compared to state of the art techniques [73].

Table 5.9: several comparative techniques of action classification

Method Accuracy on (UT-Interaction dataset)


Ryoo et al. 2009[79] 70.8%
Branden et al. 2011[106] 78.9%
RPT+HV Yu et al.2015 [106] 85.4%
LMDI Sahoo et al. 2018[14] 87.5%
Ours 92.1%

Table 5.9 results show the different technique of comparison and their result on the UT-interaction

dataset. Our proposed framework outperform other techniques.

Table 5.10: Comparison of action classification on the HMDB51 dataset


Approach Average accuracy
Test set Test set Test set Test set Test set Test set
(200,0) (200,50) (200,300) (200,550) (200,800) (0,1050)
VGG[RMS] 98 76 63 60 61 60
VGG[ADAM] 94 77 71 72 70 69
Inception[RMS] 84 76 65 61 62 61
Ours 96 76 66 61 64 63

81
We evaluated the performance on HMDB51 dataset on a limited scale such as hug and fight due to

many of the actions are the single person and poorly visible. In Table 4.10 “Test set (X, Y)” means the

test data which is combined with Y from HMDB51 dataset and X test data from the ITLab action

dataset. The lowest accuracy for Test set (0,1050) is 69% which is better than Varol et al. [63].We

obtain the best performance on Test set (200,0) with VGG pre-trained model applied by RMSProp

optimizer.

5.3 EER-ASSL based human action recognition

5.3.1 Experiment Setup

Extensive experiments are conducted using the benchmark dataset such as KTH action dataset and

ITlab dataset. The performance are compared with recent popular methods such as multi-label

hierarchical Dirichlet process(ML-HDP)[107]. The experiment implementation are conducted using a

single NVIDIA TITAN X with cuDNN[105], python3.6 and Tensorflow[108]. For this experiment we

consider YOLOv2 as the base person detector.

5.3.1 Benchmark dataset


We have used the KTH action dataset. It has six types of action such as boxing, hand clapping, hand

waving, jogging, running and walking. There are 25 subjects that recorded with a static camera with

25fps where the resolution is 160×120 pixels. All sequences were taken over homogeneous

background. We divide this images into two sets training and test sets. Our action recognition model is

trained with trained dataset.

82
Hand waving Jogging Running

Walking Hand clapping Boxing

Figure 5.16. KTH action dataset with six human action

Using the local noisy dataset and benchmark dataset, we compare EER-ASSL and SSL that shows

much improvement on result. We have used Adam optimizer where learning rate 0.001 are shown in

figure 5.17.

83
Figure 5.17 The information flow of the rollback based EER- ASSL and SSL
Using the local dataset images, we compare EER-ASSL and SSL which shows improved result on

local and benchmark datasets. In the beginning, EER-ASSL shows lower performance but gradually

improve the performance, specifically after the fourth bin. On the other hand SSL remain very

inconsistent since the beginning as shown in figure 5.17.

5.3.2 Compare with the state of the art Methods on KTH action dataset.
EER-ASSL based human action recognition is compared with several other method. We divide the

dataset into 80 and 20 ratio. Where 80 percent data used as training data and 20% for testing purpose.

For the each video clip there are 450 images extracted. For the test dataset similar number of image

extracted.

Table -5.10 Performance comparison of recent methods and EER-ASSL method

84
Methods KTH action dataset
BoW(STIP) [109] 91.8%
ML-HDP(STIP)[107] 91.3%
ML-HDP (IDT)[107] 95.8%
ML-HDP(sTDD)[107] 94.1%
ML-HDP(IDT+sTDD) [107] 96.5%
Ours(EER-ASSL for Action recognition) 96.9%

Table 5.11 demonstrate the performance of our method on KTH dataset. Here, our propose method

trained with VGG model and Adam optimizer. We compare our results against (ML-HDP)[107],Bow

(STIP) [109] which show that the EER-ASSL based Adaptive learning captures better features and

shows better performance on KTH dataset.

5.4 Summary
In this chapter, we have presented experimental results of single person human action recognition, two

person interaction recognition and EER-ASSL based human action recognition which validate our

Rollback based ASSL method.

85
Chapter 6

Conclusion and Future Research

We have seen a great advancement in the area of Computer Vision in the last few years. In particular,

human action recognition performance capability improved in these research areas. If we consider the

challenge of ImageNet Large-scale visual recognition where the top-5 error reduced to 3% in 2016

[37]. Similarly, classification accuracy such as human action recognition improved greatly. In this

dissertation, we developed several methods, especially EER-ASSL that improved the human action

classification performance.

6.1 Thesis summary


This thesis investigates the problems involved in finding the human action of a person in image and

video data. Our key discovery includes: 1) Adapting Deep learning with an emphasis on the Active

learning which allows us to select correct samples for better human action recognition. 2. Apply the

rollback process combined with ensemble learning to deal with noisy data for efficient human

recognition. This thesis introduces an Active Semi Supervised Learning framework that improves the

learning task and reduce the error rate. Our experiment performed well both single person action

recognition and two-person interaction recognition.

6.2 Contributions
6.2.1 Single person Action recognition
Single person actions are such as using a computer, jumping, phoning, reading, and taking a photo in

simple and cluttered environment. We use deep learning to train on human actions from a large amount

86
of benchmark dataset images to improve human action classification accuracy. Two major issues

working with the deep learning framework are the number of labeled images and the variations in

poses. We propose a better approach to ensemble learning, which combines an object detector and

incremental active learning (AL) to tackle human action recognition with limited training examples,

because AL alone is not sufficient to improve performance, due to the varieties of poses and

backgrounds. Our propose ASSL framework that combine with ensemble learning, improves action

recognition performance and reduces the burden of human labeling efforts. We validated our

experiments with benchmark datasets (PASCAL VOC 2012 and the Stanford 40 Actions dataset). The

proposed algorithm produces state-of-the-art results and outperforms other approaches. Our verities of

human action recognition experiment show that ASSL based technique can be applied in many

different domains, such as surveillance systems, patient monitoring, sports video analysis, etc.

6.2.2 Two person interaction recognition

The experiment result in previous chapter shows that our propose method outperform many other

techniques for the two-person interaction recognition. We are in the process of making our method

more effective that not only combines SSL and AL to tackle the issue but also use a partial training

examples. This framework is decreases the human labeling time. We represented a distinctive dataset

with an experimental methodology to backing our work and also compare it with benchmark datasets

as well. Our experiment shows good performance on benchmark Action recognition datasets UT-

Interaction [104] and HMDB51 [80] but we plan to improve the accuracy of the proposed method with

a more diverse human action recognition dataset in the future.

87
6.2.3 EER-ASSL based human action recognition

We propose a rapid adaptive deep learning based EER-ASSL framework that can recognize human

activities such as sports event or motion based activities in noisy environment. This is a challenging

work due to lack of labeled data, drift problem, large feature vector and large volume of raw video

files. Our experiment mainly covered one popular benchmark datasets KTH action and ITLab dataset.

Our proposed method performed significantly well on these datasets.

6.3 Future Research

Finally, this thesis provides an approach to making use of incremental active learning with a target to

improve performance. These hypotheses serve as foundational work for further research in human

action recognition in videos. Based on our method, a number of new problems can be targeted and

solved smoothly which includes event detection, security surveillance etc. Future work may include the

investigation of human action recognition using less labeled data. Looking at the results, we see that a

deep convolutional neural network is suitable for human action recognition on these benchmark

datasets, but the deep CNN still requires much more fine tuning with a large-volume dataset in order to

achieve better accuracy.

88
Publications
Journal Publications
1. Minhaz Uddin Ahmed, Kim Jin Woo, Kim Yeong Hyeon, Md. Rezaul Bashar, Phill Kyu Rhee,
Wild Facial Expression Recognition Based on Incremental Active Learning. Cognitive System
Research, Elsevier, 2018
2. Minhaz Uddin Ahmed, Yeon Kim Yeongh, Jin Woo Kim, Md Rezaul Bashar, and Phill Kyu Rhee,
Two person Interaction Recognition Based on Effective Hybrid Learning, Ksii Transactions on Internet
and Information Systems, February 28, 2019
3. Minhaz Uddin Ahmed, Kim Yeong hyeon, Md. Rezaul Basharb, Phill Kyu Rhee “Effective Human
Action Recognition Using Adaptive Deep Learning” Multimedia tools and application, Springer.
(Under review)
4. Dong Kyun Shin, Minhaz Uddin Ahmed and Phill Kyu Rhee “Incremental Deep Learning for
Robust Object Detection in Unknown Cluttered Environments” IEEE Access journal. 2018
5. Phill Kyu Rhee, Enkhbayar Erdenee, Shin Dong Kyun, Minhaz Uddin Ahmed, Songguo Jin,
Active and semi-supervised learning for object detection with imperfect data, Cognitive Systems
Research, Elsevier, 2017
6. Miyoung Nam, Minhaz Uddin Ahmed, Yan Shen, and Phill Kyu Rhee, “Mouth Tracking for
Hands-free Robot Control Systems” International Journal of Control, Automation, and Systems, vol.
12, no. 3, pp.628-636, 2014
7. Shin, Hak-Chul; Shen, Yan; Khim, Sarang; Sung, WonJun; Ahmed, Minhaz Uddin Ahmed; Hong,
Yo-Hoon; Rhee, Phill-Kyu; “Performance Improvement of Eye Tracking System using Reinforcement
Learning” KOREA SCIENCE journal, 2013

Conference Publications:
Minhaz Uddin Ahmed, Kim Jin Woo, Miyoung Nam, Md Rezaul Bashar and Phill Kyu Rhee, “A
Deep Learning based approach for Human Action Recognition” 1st International Conference on
Machine Learning and Data Engineering (iCMLDE), 2017
Kim Jin Woo, Minhaz Uddin Ahmed, Md Rezaul Bashar and Phill Kyu Rhee, “High Efficiency
Facial Expression Recognition based Active Semi-Supervised Learning” 1st International Conference
on Machine Learning and Data Engineering (iCMLDE), 2017
Shibo Han, Minhaz Uddin Ahmed, Phill Kyu Rhee, “Monocular SLAM and Obstacle Removal for
Indoor Navigation” International Conference on Machine Learning and Data Engineering (iCMLDE),
2018

89
Appendix A: Active semi-supervised learning tool for human action recognition.

Figure A.1: Common dataset merging tool

The common dataset merging tool that used to combine two similar batch datasets mostly for training

purpose that increase the data size.

Figure A.2: shows a common SSL tool for evaluation

After each dataset training, the performance evaluation is ensured by the SSL prediction. In Figure A.2

shows by using load model button we load the trained model and by the load IMDB button we load the

batch dataset we want to evaluate. Here the evaluation score is 60 which is shows in display by

90
asterisks mark. If the prediction result is less than the threshold value and has false classifications, we

apply AL.

Figure A.3 using an active learning tool for human action recognition. Here the action is using a

computer.

In Figure A.3 the grid shows the value for each human action prediction score. If the prediction score is

less than threshold value we apply AL. Then we train these datasets. The entire process continued for

each dataset until the performance reached saturation.

91
We adopted the VGG16 discussed in chapter-2 section-2.1.5.1 pre-trained network model, which

perform well for different classification. We fine tune our network, and find that, during training, about

50,000 iterations produce optimal performance. We set the learning rate at 0.0001, and the batch size at

8. For AL human labeling, we consider a threshold value around 0.9 for optimal performance.

Figure A.4 Training the human action dataset using the matconvnet library [35] based ASSL

tool. Here the first data set shows the poor performance.

In this situation AL plays a pivotal role to identify the noisy dataset. Eliminating a noisy dataset and re-

organizing them for training is one of the important parts for performance improvement. Apart from

the dataset other parameters such as the number of iteration, the learning rate and a pre-trained model

also has a significant effect on the training process.

92
Figure A.5 shows the training performance of four sets of data where set 1 and set 2 produce the same

result, set3 progressed but the set 4 produces poor result as a result training rollback to set3.

Among the total dataset, when the current dataset performance is not satisfactory then the rollback

process comes back to the previous dataset.

93
Figure A.6 shows the training performance after incremental learning. Here the performance reaches

the level of saturation.

After incremental training, when the pre-trained model reaches saturation, more training has no effect

on the performance and we get the final trained model.

94
Bibliography
[1] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput.
Vis., vol. 115, no. 3, pp. 211–252, 2015.
[2] T. Y. Lin et al., “Microsoft COCO: Common objects in context,” Lect. Notes Comput. Sci.
(including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8693 LNCS, no.
PART 5, pp. 740–755, 2014.
[3] D. Murray and A. Basu, “Motion tracking with an active camera - Pattern Analysis and
Machine Intelligence, IEEE Transactions on,” vol. 16, no. 5, 1994.
[4] L.-C. F. Cheng-Ming Huang, Yi-Ru Chen, “Real-time object detection and tracking on a
moving camera platform,” 2009 Iccas-Sice, pp. 717–722, 2009.
[5] S.-R. Ke, H. Thuc, Y.-J. Lee, J.-N. Hwang, J.-H. Yoo, and K.-H. Choi, A Review on Video-
Based Human Activity Recognition, vol. 2, no. 2. 2013.
[6] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions as Space-Time Shapes,”
Pami, vol. 29, no. 12, pp. 2247–2253, 2007.
[7] D. G. Lowe, “Distinctive image features from scale invariant keypoints,” Int. J. Comput. Vis.,
vol. 60, pp. 91–11020042, 2004.
[8] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection.”
[9] S. Sedai, M. Bennamoun, and D. Huynh, “Context-based appearance descriptor for 3D human
pose estimation from monocular images,” DICTA 2009 - Digit. Image Comput. Tech. Appl., no.
January, pp. 484–491, 2009.
[10] A. Veeraraghavan, A. K. Roy-Chowdhury, and R. Chellappa, “Matching shape sequences in
video with applications in human movement analysis,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 27, no. 12, 2005.
[11] T. V. Duong, H. H. Bui, D. Q. Phung, and S. Venkatesh, “Activity recognition and abnormality
detection with the switching hidden semi-Markov model,” Proc. - 2005 IEEE Comput. Soc.
Conf. Comput. Vis. Pattern Recognition, CVPR 2005, vol. I, pp. 838–845, 2005.
[12] Y. Du, F. Chen, and W. Xu, “Human interaction representation and recognition through motion
decomposition,” IEEE Signal Process. Lett., vol. 14, no. 12, pp. 952–955, 2007.
[13] C. Schuldt, L. Barbara, and S.- Stockholm, “Recognizing Human Actions : A Local SVM
Approach ∗ Dept . of Numerical Analysis and Computer Science,” Pattern Recognition, 2004.
ICPR 2004. Proc. 17th Int. Conf., vol. 3, pp. 32–36, 2004.
[14] S. Prakash Sahoo and S. Ari, “On an algorithm for Human Action Recognition,” Expert Syst.
Appl., 2018.
[15] I. Laptev, “On space-time interest points,” in International Journal of Computer Vision, 2005.

95
[16] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional Two-Stream Network Fusion for
Video Action Recognition,” Cvpr, no. i, pp. 1933–1941, 2016.
[17] M. Hasan and A. K. Roy-Chowdhury, “A Continuous Learning Framework for Activity
Recognition Using Deep Hybrid Feature Models,” Ieee Tmm, vol. 17, no. 11, pp. 1909–1922,
2015.
[18] T. Wang, Y. Chen, M. Zhang, J. I. E. Chen, and H. Snoussi, “Internal Transfer Learning for
Improving Performance in Human Action Recognition for Small Datasets,” IEEE Access, vol.
5, pp. 17627–17633, 2017.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep
Convolutional Neural Networks,” Adv. Neural Inf. Process. Syst., pp. 1–9, 2012.
[20] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks:
Visualising Image Classification Models and Saliency Maps,” Iclr, p. 1-, 2014.
[21] P. F. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan, “Object Detection with
Discriminatively Trained Part-Based Models.”
[22] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor, “Glimpse Clouds: Human Activity Recognition
from Unstructured Feature Points,” 2018.
[23] J. Choi et al., “Context-aware Deep Feature Compression for High-speed Visual Tracking,” pp.
479–488, 2018.
[24] Y. Zhou, X. Sun, Z.-J. Zha, and W. Zeng, “MiCT: Mixed 3D/2D Convolutional Tube for
Human Action Recognition,” Cvpr, pp. 449–458, 2018.
[25] E. Marinoiu, M. Zanfir, and V. Olaru, “3D Human Sensing , Action and Emotion Recognition
in Robot Assisted Therapy of Children with Autism,” Cvpr, pp. 2158–2167, 2018.
[26] P. Vicol, M. Tapaswi, L. Castrejon, and S. Fidler, “MovieGraphs: Towards Understanding
Human-Centric Situations from Videos,” pp. 8581–8590, 2017.
[27] P. Wei, Y. Liu, T. Shu, N. Zheng, and S.-C. Zhu, “Where and Why Are They Looking? Jointly
Inferring Human Attention and Intentions in Complex Tasks,” Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR 18), pp. 6801–6809, 2018.
[28] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image
Recognition,” pp. 1–14, 2014.
[29] C. Szegedy et al., “Going deeper with convolutions,” Proc. IEEE Comput. Soc. Conf. Comput.
Vis. Pattern Recognit., vol. 07–12–June, pp. 1–9, 2015.
[30] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” pp. 1–15, 2014.
[31] T. and G. H. Tieleman, “Lecture 6.5-rmsprop: Divide the gradient by a running average ofits
recent magnitude.,” COURSERA Neural Networks Form. Learn., 2012.
[32] M. Stikic, K. Van Laerhoven, and B. Schiele, “Exploring semi-supervised and active learning

96
for activity recognition,” Wearable Comput. 2008. ISWC 2008. 12th IEEE Int. Symp., pp. 81–
88, 2008.
[33] B. Settles, Active Learning, vol. 6, no. 1. 2012.
[34] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp.
1137–1149, 2017.
[35] A. Vedaldi and K. Lenc, “MatConvNet Convolutional Neural Networks for MATLAB,” 2016.
[36] P. K. Rhee, E. Erdenee, S. D. Kyun, M. U. Ahmed, and S. Jin, “Active and semi-supervised
learning for object detection with imperfect data,” Cogn. Syst. Res., vol. 45, pp. 109–123, 2017.
[37] “Andrej Karpathy August 2016,” no. August, 2016.
[38] L. Fei-Fei, R. Fergus, and P. Perona, “Learning Generative Visual Models From Few Training
Examples: An Incremental Bayesian Approach Tested on 101 Object Categories,” IEEE CVPR
Work. Gener. Model Based Vis., 2004.
[39] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual
object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.
[40] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale
Video Classification with Convolutional Neural Networks.”
[41] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput.
Vis., vol. 115, no. 3, pp. 211–252, 2015.
[42] A. Z. Olivier Chapelle, Bernhard Scholkopf, semi supervised learning. 2010.
[43] D. Yarowsky, “Unsupervised word sense disambiguation rivaling supervised methods,” Proc.
33rd Annu. Meet. Assoc. Comput. Linguist. -, pp. 189–196, 1995.
[44] A. B. Goldberg, “Multi-Manifold Semi-Supervised Learning,” pp. 169–176, 2009.
[45] S. Jones and L. Shao, “A Multigraph Representation for Improved Unsupervised / Semi-
supervised Learning of Human Actions,” Cvpr, 2014.
[46] T. Zhang, S. Liu, C. Xu, and H. Lu, “Boosted multi-class semi-supervised learning for human
action recognition,” Pattern Recognit., vol. 44, no. 10–11, pp. 2334–2342, 2011.
[47] C. Charles and A. James, Document resume ed 336 049. 1991.
[48] M. Li and I. K. Sethi, “Confidence-Based Active Learning,” vol. 28, no. 8, pp. 1251–1261,
2006.
[49] J. Sourati, M. Akcakaya, D. Erdogmus, T. K. Leen, and J. G. Dy, “A Probabilistic Active
Learning Algorithm Based on Fisher Information Ratio,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 40, no. 8, pp. 2023–2029, 2018.
[50] J. Bernard, M. Hutter, M. Zeppelzauer, D. Fellner, and M. Sedlmair, “Comparing Visual-

97
Interactive Labeling with Active Learning: An Experimental Study,” IEEE Trans. Vis. Comput.
Graph., vol. 24, no. 1, pp. 298–308, 2018.
[51] S. Hao, J. Lu, P. Zhao, C. Zhang, S. C. H. Hoi, and C. Miao, “Second-Order Online Active
Learning and Its Applications,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 7, pp. 1338–1351,
2018.
[52] A. Roederer, “Active learning for classification of medical signals,” no. November, 2012.
[53] Goodfellow, Ian, Y. Bengio, and A. Courville, “Deep Learning,” MIT Press, 2016.
[54] Qiang Yang, “a Survey on Transfer Learning,” vol. 1, no. 10, pp. 1–15, 2010.
[55] G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu, “Text classification without negative examples
revisit,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 1, pp. 6–20, 2006.
[56] P. Wu and T. G. Dietterich, “Improving SVM accuracy by training on auxiliary data sources,”
Proc. Int. Conf. Mach. Learn., pp. 110–118, 2004.
[57] C. Republic, “Acl 200 7,” Comput. Linguist., 2007.
[58] C. Szegedy et al., “Going Deeper with Convolutions,” pp. 1–9, 2014.
[59] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception
Architecture for Computer Vision,” 2015.
[60] J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for Online Learning and
Stochastic Optimization,” J. Mach. Learn. Res., vol. 12, pp. 2121–2159, 2011.
[61] G. E. Hinton, N. Srivastava, and K. Swersky, “Lecture 6a- overview of mini-batch gradient
descent,” COURSERA Neural Networks Mach. Learn., p. 31, 2012.
[62] G. Gkioxari, U. C. Berkeley, R. Girshick, and U. C. Berkeley, “Contextual Action Recognition
with R*CNN,” Cvpr, 2015.
[63] I. Laptev and C. Schmid, “Long-term Temporal Convolutions for Action Recognition To cite
this version : Long-term Temporal Convolutions for Action Recognition,” vol. 40, no. 6, pp.
1510–1517, 2015.
[64] I. Laptev and P. Pérez, “Retrieving actions in movies,” Proc. IEEE Int. Conf. Comput. Vis.,
2007.
[65] Y. Iwashita, A. Takamine, R. Kurazume, and M. S. Ryoo, “First-person animal activity
recognition from egocentric videos,” Proc. - Int. Conf. Pattern Recognit., no. i, pp. 4310–4315,
2014.
[66] S. Ma, J. Zhang, S. Sclaroff, N. Ikizler-Cinbis, and L. Sigal, “Space-Time Tree Ensemble for
Action Recognition and Localization,” Int. J. Comput. Vis., vol. 126, no. 2–4, pp. 314–332,
2018.
[67] V. Delaitre, I. Laptev, and J. Sivic, “Recognizing human actions in still images: a study of bag-

98
of-features and part-based representations,” 2010.
[68] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “R-CNNs for Pose Estimation and Action
Detection,” arXiv Prepr. arXiv1406.5212, pp. 1–8, 2014.
[69] G. Ge, K. Yun, D. Samaras, and G. J. Zelinsky, “Action classification in still images using
human eye movements,” 2015 IEEE Conf. Comput. Vis. Pattern Recognit. Work., pp. 16–23,
2015.
[70] F. S. Khan, J. Xu, J. Van De Weijer, A. D. Bagdanov, R. M. Anwer, and A. M. Lopez,
“Recognizing Actions Through Action-Specific Person Detection,” IEEE Trans. Image
Process., vol. 24, no. 11, pp. 4422–4432, 2015.
[71] H. Wang, A. Kläser, C. Schmid, and C. L. Liu, “Action recognition by dense trajectories,”
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 3169–3176, 2011.
[72] X. Peng, C. Zou, Y. Qiao, and Q. Peng, “Action Recognition with Stacked Fisher Vectors,”
Eccv, pp. 581–595, 2014.
[73] K. N. E. H. Slimani, Y. Benezeth, and F. Souami, “Human interaction recognition based on the
co-occurrence of visual words,” IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
Work., pp. 461–466, 2014.
[74] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks and applications in
vision,” ISCAS 2010 - 2010 IEEE Int. Symp. Circuits Syst. Nano-Bio Circuit Fabr. Syst., pp.
253–256, 2010.
[75] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016
IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770–778, 2016.
[76] L. Sun, K. Jia, K. Chen, D. Y. Yeung, B. E. Shi, and S. Savarese, “Lattice Long Short-Term
Memory for Human Action Recognition,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017–
Octob, pp. 2166–2175, 2017.
[77] Z. Cao, T. Simon, S. E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using
part affinity fields,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017,
vol. 2017–Janua, no. Xxx, pp. 1302–1310, 2017.
[78] Y. Han, S. L. Chung, S. F. Chen, and S. F. Su, “Two-Stream LSTM for Action Recognition
with RGB-D-Based Hand-Crafted Features and Feature Combination,” Proc. - 2018 IEEE Int.
Conf. Syst. Man, Cybern. SMC 2018, pp. 3547–3552, 2019.
[79] M. S. Ryoo and J. K. Aggarwal, “Spatio-temporal relationship match: Video structure
comparison for recognition of complex human activities,” Proc. IEEE Int. Conf. Comput. Vis.,
no. Iccv, pp. 1593–1600, 2009.
[80] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: A large video database for
human motion recognition,” Proc. IEEE Int. Conf. Comput. Vis., no. November 2011, pp.
2556–2563, 2011.

99
[81] S. Hanneke, “Rates of convergence in active learning,” Ann. Stat., vol. 39, no. 1, pp. 333–361,
2011.
[82] D. Cohn, L. Atlas, and R. Ladner, “Improving Generalization with Active Learning,” Mach.
Learn., vol. 15, no. 2, pp. 201–221, 1994.
[83] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Batch mode active learning and its application to
medical image classification,” Proc. 23rd Int. Conf. Mach. Learn. - ICML ’06, pp. 417–424,
2006.
[84] S. Dasgupta, “Two Faces of Active Learning.pdf,” no. Figure 2, pp. 1–20.
[85] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the Devil in the Details:
Delving Deep into Convolutional Nets,” pp. 1–11, 2014.
[86] X. Y. Zhang, S. Wang, and X. Yun, “Bidirectional active learning: A two-way exploration into
unlabeled and labeled data set,” IEEE Trans. Neural Networks Learn. Syst., vol. 26, no. 12, pp.
3034–3044, 2015.
[87] Y. Yang and M. Loog, “Active learning using uncertainty information,” Proc. - Int. Conf.
Pattern Recognit., pp. 2646–2651, 2017.
[88] D. K. Shin, M. U. Ahmed, and P. K. Rhee, “Incremental Deep Learning for Robust Object
Detection in Unknown Cluttered Environments,” IEEE Access, vol. XX, pp. 2169–3536, 2018.
[89] J. Kwon and K. M. Lee, “Tracking of a non-rigid object via patch-based dynamic appearance
modeling and adaptive basin hopping monte carlo sampling,” 2009 IEEE Comput. Soc. Conf.
Comput. Vis. Pattern Recognit. Work. CVPR Work. 2009, vol. 2009 IEEE, pp. 1208–1215,
2009.
[90] R. Gonzalez and R. Woods, Digital image processing. 2002.
[91] A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, “Facial expression
recognition with Convolutional Neural Networks: Coping with few data and the training sample
order,” Pattern Recognit., vol. 61, pp. 610–628, 2017.
[92] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” 2016.
[93] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual
object classes (VOC) challenge,” International Journal of Computer Vision, 2010. .
[94] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, “Human action recognition
by learning bases of action attributes and parts,” Proc. IEEE Int. Conf. Comput. Vis., pp. 1331–
1338, 2011.
[95] R. Girshick, “Fast R-CNN,” 2015.
[96] S. Yan, J. S. Smith, W. Lu, and B. Zhang, “Multi-branch Attention Networks for Action
Recognition in Still Images,” IEEE Trans. Cogn. Dev. Syst., vol. 14, no. 8, 2017.
[97] Y. Jia et al., “Caffe: Convolutional Architecture for Fast Feature Embedding *.”

100
[98] S. Chetlur and C. Woolley, “cuDNN: Efficient Primitives for Deep Learning,” arXiv Prepr.
arXiv …, pp. 1–9, 2014.
[99] A. Vedaldi and K. Lenc, “MatConvNet - Convolutional Neural Networks for MATLAB,”
Arxiv, pp. 1–15, 2014.
[100] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “R-Cnn,” pp. 580–587, 2014.
[101] Z. Yu, C. Li, J. Wu, J. Cai, M. N. Do, and J. Lu, “Action Recognition in Still Images with
Minimum Annotation Efforts,” IEEE Trans. Image Process., vol. 25, no. 11, pp. 5479–5490,
2016.
[102] G. Gkioxari, U. C. Berkeley, R. Girshick, and U. C. Berkeley, “Actions and Attributes from
Wholes and Parts.”
[103] Z. Zhao, H. Ma, and S. You, “Single Image Action Recognition using Semantic Body Part
Actions,” Iccv, pp. 1–9, 2017.
[104] J. K. Ryoo, M. S. and Aggarwal, “Interaction Dataset, ICPR contest on Semantic Description of
Human Activities (SDHA),” 2010.
[105] S. Chetlur et al., “cuDNN: Efficient Primitives for Deep Learning.”
[106] W. Brendel and S. Todorovic, “Learning spatiotemporal graphs of human activities,” Proc.
IEEE Int. Conf. Comput. Vis., no. Iccv, pp. 778–785, 2011.
[107] N. A. Tu, T. Huynh-The, K. U. Khan, and Y. K. Lee, “ML-HDP: A Hierarchical Bayesian
Nonparametric Model for Recognizing Human Actions in Video,” IEEE Trans. Circuits Syst.
Video Technol., vol. 29, no. 3, pp. 800–814, 2019.
[108] M. Abadi et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed
Systems,” 2016.
[109] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-
temporal features for action recognition To cite this version : Evaluation of local spatio-
temporal features for action recognition,” 2011.

101

You might also like