Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Chapter 7

Human Behavior Analysis

Due to the wide range of applications human action recognition and its representa-
tion is a popular research topic. The aim of action recognition is to automatically
identify the action of a person based on some kind of sensor data. In this monograph,
we focus on vision sensors that provide a stream of images over time. Detection of
human actions or activities based on a video stream is challenging mainly due to the
following two processing steps. First, the person has to be detected within the im-
ages and its pose has to be estimated, which is very complex due to the many degrees
of freedom of the human body (see Chap. 6). In the context of action recognition
the pose estimation step is often replaced by the calculation of a motion descriptor.
Second, the sequent pose or motion descriptors have to be set into temporal rela-
tions in order to identify the underlying action. Unfortunately, these relations are
very complex and difficult in real world scenes.

7.1 Related Work


In the following we will briefly summarize related works. A more general overview
of human motion capture, analysis and its representation can be found in [149, 150,
183, 1]. A very important property of actions is its hierarchical nature. We use the
action hierarchy notation as proposed in [150]: action/motor primitives, actions, and
activities. Action primitives or motor primitives are atomic entities that are used to
build actions. Actions are, in turn, combined to activities. Behavior is another term
that is often used in the context of activity recognition. In this monograph we will
use the term to describe a composition of activities. Behavior patterns are therefore
a high-level representation of complex sequences of low-level actions.

7.1.1 Low-Level Action Primitives


Most of the following approaches represent actions at a low-level. Polana and Nel-
son [182] used spatiotemporal templates of motion features to recognize action
primitives like walking and climbing. Efros et al. [46] recognized simple actions

c Springer International Publishing Switzerland 2015 135


J. Spehr, On Hierarchical Models for Visual Recognition & Learning of Objects, Scenes, & Activities,
Studies in Systems, Decision and Control 11, DOI: 10.1007/978-3-319-11325-8_7
136 7 Human Behavior Analysis

like running and walking in low quality video streams, where the person is only 30
pixels tall. They used a simple normalized-correlation based tracker to get a figure-
centric sequence and calculated features based on blurred optical flow. Action is
classified by matching these features with a database. Ali and Shah [4] used op-
tical flow as well. They derived kinematic features like divergence, vorticity and
symmetry from the optical flow and used them to specify spatiotemporal patterns.
Natarajan and Nevatia [163] extended the work of Efros et al. [46]. They explored
a set of possible windows at each time step and considered large scale differences
in order to improve the robustness. Different from [46], where the optical flow was
directly used as feature representation, Riemenschneider et al. [191] proposed to
get binary stable optical flow volumes using a maximally stable volumes detec-
tor. These sets of binary optical flow volumes were used as features in a 3d shape
context descriptor. Another low-level representation of actions are motion history
images [18], where each pixel describes the motions occurred at that point during
the previous time steps. Since actions are represented by images simple template
matching can be used to recognize actions. Laptev and Lindeberg [118] defined
space-time interest points as local structures in space-time, where the image values
have significant local variations in both space and time. The interest points are de-
tected using an extension of the Harris interest point detector. The primitive descrip-
tor characterizes the spatio-temporal neighborhood of the primitive and is built from
normalized spatio-temporal Gaussian derivatives. Laptev and Lindeberg called the
primitive descriptor local jet. A compact and view-invariant representation of action
primitives was introduced by Rao et al. [187]. The primitives are defined in units
called dynamic instants. Dynamic instants were computed from discontinuities of
2d trajectories of body parts like e.g. the hand. The primitives were described using
a parameter ’sign’, which represents the change of the motion direction at the in-
stant. Furthermore the time period between two dynamic instants was used. Similar
to dynamic instants are key poses that were used by Reng et al. [190]. The key poses
are found based on the curvature and covariance of the normalized trajectories. Lu
and Ferrier [136] used simple linear dynamic models to define action primitives.
These primitives were automatically detected using a two-threshold, multidimen-
sional segmentation algorithm applied to complex motions. The primitive descrip-
tor consists of two matrices which describe the deterministic and the stochastic part
of the motion. Another feature detector was proposed by Dollar et al. [42]. They
detected spatiotemporal interest points by applying a quadrature pair of 1D Gabor
filters to the temporal dimension of each image point and searched for local max-
ima of the response function. The detected interest points were described by cuboid
descriptors, which contain the spatiotemporal neighborhood.

7.1.2 Action Recognition


In more natural scenes the short action primitives are combined to complex activi-
ties. Generally, the direct estimate of the posterior probabilities allows discrimina-
tive approaches (e.g. conditional random fields [217] or support vector machines
[207]) to achieve better classification performance than generative ones. However,

You might also like