Context Information for Human Behavior Analysis and Prediction

J. Calvo1 , M.A. Patricio2, C. Cuvillo1 , and L. Usero3

Dpto. de Organizaci on y Estructura de la informaci on, Universidad Polit ecnica de Madrid, Spain 2 Grupo de Inteligencia Articial Aplicada. Dpto. de Inform atica. Universidad Carlos III de Madrid, Spain 3 Dpto. Ciencias de la Computaci on. Universidad de Alcal a, Spain {ccuvillo,jcalvo},,

Abstract. This work is placed in the context of computer vision and ubiquitous multimedia access. It deals with the development of an automated system for human behavior analysis and prediction using context features as a representative descriptor of human posture. In our proposed method, an action is composed of a series of features over time. Therefore, time sequential images expressing human action are transformed into a feature vector sequence. Then the feature is transformed into symbol sequence. For that purpose, we design a posture codebook, which contains representative features of each action type and dene distances to measure similarity between feature vectors. The system is also able to predict next performed motion. This prediction helps to evaluate and choose current action to show.


In the last decade, we have witnessed a more user-centered implementation of computer science research. Ambient Intelligence (AmI) is a new paradigm that promotes the advancement of science and technology to build smart environments. AmI proponents advocate an invisible technological support layer of information processing to improve the quality of life in public and private spaces [1]. AmI puts forward the criteria for the design of intelligent environments, with ears and eyes [2]. AmI can monitor the user and can create a safety net around them, making their surroundings more secure and pleasant to live and inhabit. In AmI application, machines are able to understand information of their environments. Computer vision researches deal with this kind of problems, dedicated to interpret image sequences [3] Last investigations in computer vision, since xed images until recorded video sequences, has mainly concentrated in research on the evaluating of behavior recognition. Experiments in computer science research are concentrated on the
Funded by project IMSERSO-AUTOPIA.
J. Calvo et al.

development of new methods in analysis of data. Main tasks consist in carrying out extraction and processing as much information as possible about objects and humans in the scene [4]. Regards to behavior understanding, several methods for detection in specic domains can be found. Our approach focus in identication and classication human actions. This approach belongs to motion analysis, specically, activity classication and motion detection. Our paper presents an approach which is included into Indirect Model Use [5] from a blob (surrounding box of a mobile object) representation in two dimensions, 2D. We reference next some of the last related works that are carrying out nowadays, using similar techniques. 1.1 Related Works

Blob representation is normally described by some of the gure-ground segmentation approaches. The object or the subject is represented as a blob or number of blobs each having similar features. The similarities can be coherent ow [6], similar colours [7], or both [8]. The main philosophy of grouping information according to similarities in inspired by research into the human visual system by the Gestalt school in the 1930s [9]. Regards to pose estimation, our approach uses an indirect model. Examplebased approaches use a database that describe poses in both image space and pose space. Our work is framed into look-up table, in our case this query information means structured data situated in codebooks, we will explain it in next sections. The methods in this class use an a priori model when estimating the pose of the subject. They use the model as a reference (similar to [10]) or query tables where relevant information may be obtained to drive the representation of extracted data. Mori and Malik [11] extract external and internal contours of an object. Shape contexts are employed to encode the edges. In an estimation step, the stored exemplars are deformed to match the image observation. In this deformation, the location of the hand-labelled 2D locations of joints also changes. The most likely 2D joint estimate is found by enforcing 2D image distance consistency between body parts. Shape deformation is also used by Sullivan and Carlsson [12]. To improve the robustness of the point transferral, the spatial relationship of the body points and color information is exploited. Loy et al. [13] perform interpolation between key frame poses based on [14] and additional smoothing constraints. Manual intervention is necessary in certain cases. Referring to dimensionality, 2D models are the most suitable for our representation. These are appropriate for motion parallel to the image plane and they are sometimes used for motion analysis like in our approach. Ju et al. [6], Howe et al. [15] used a so-called Cardboard model in which the limbs are modeled as planar patches. Each segment has 7 parameters that allow it to rotate and scale according to the 3D motion. Navaratnam et al. [16] take a similar approach but model some parameters implicitly. In [17], an extra patch width parameter was added to account for scaling during in-plane motion. In [18], [19], [20], the human body is described by a 2D scaled prismatic model [21].

Context Information for Human Behavior Analysis and Prediction


Fig. 1. General Scheme

System Overview

The proposed system is depicted in Figure 1. The system architecture is divided in four main parts, feature extraction, mapping features to codebooks, current state behavior recognition and next state behavior prediction. These parts are widely explained along next section. For feature extraction we use background subtraction and threshold the dierence between current frame and background image to segment the foreground object. After the foreground segmentation, we extract the blob features using techniques that will be described after this section. The extracted features are used for latter action recognition. After feature extraction, a Symbols Sequence Vector is used to store the mapping between the codebooks of each behavior, (e.g. walking, running) and a mobile object. Each behavior to analyze is represented by a codebook. A virtual window runs along the Symbols Sequence, it is able to gets last states. State behavior recognition is carried out by means of matches. These matches belong to one kind of motion, so that, we calculate the number of matches belonging to each behavior using distance measure. Using some probability computation, we add the necessary information about next state prediction. Prediction has a feedback with the chosen current state and so on (see Figure 1). 2.1 Feature Extraction

Human action is composed of a series of postures over time. Once an mobile object is extracted from the background, it is represented by its boundary shape. A ltering mechanism to predict each movement of the detected object is a


J. Calvo et al.

common tracking method. The lter most commonly used is the Kalman lter [22]. A good feature extraction is a very essential part in order to be successful. Features can be obtained from processed blob in many ways, we have selected simple information like blob width and height, and register blob position changes by means of Kalman lter. The Kalman lter is a set of mathematical equations that provides an ecient computational (recursive) means to estimate the state of a process, in a way that minimizes the mean of the squared error. The lter is very powerful in several aspects: it supports estimations of past, present, and even future states, and it can do so even when the precise nature of the modeled system is unknown. These equations provides us features such as velocity in x and y axes, height, width and blob identication. We use this features to obtain data over time. Input: Blob representation Output: A set of blob features 1. Calculating a global velocity, which is able to give us information about blob position, regards x and y axes at the same time. We use this velocity to estimate the variation of the blob position with regards pixels over frame. BlobV elocity =
2 + v2 vx y

where vx and vy are the velocity component in axis x and y obtained by Kalman lter. We store this changes of position inside an static array that can preserve its value along the execution algorithm. 2. Obtaining a dierence between height and width. It will be a very useful data, as we will see later, in order to measure human motion changes. dif HW = BlobHeight BlobW idth We store again the relation between height and width during the whole execution. 3. Storing blob area from Kalman data. This number is given beyond the post processing blob tracking over Kalman equations. 4. Identifying blobs. Each blob registered by tracking will have a number of ID. 2.2 Mapping Features to Symbol

To apply our probability model and next state prediction to time-sequential video, the extracted features must be transformed into symbol sequence for latter action recognition. This is accomplished by a technique called: Symbols Sequence. Symbols Sequence Vector. In order to explain symbols sequence vector, we must introduce the concept of Codebook. A Codebook is a number of symbols. Each symbol matches with a representative posture feature inside a human action. We have one Codebook per human action (e.g. codebook for walking).

Context Information for Human Behavior Analysis and Prediction


Cjn where j Human Action n N umber Of Symbols Set. (e.g. Cr18 would be symbol 18 inside running codebook.) These numbers can vary depending on the target of the application (e.g. security will add a new codebook for gthing ). We can also extend the number of symbols representing each action for improving the level of accuracy. Each symbol in each codebook has some associated properties. When the extracted blob is processed, we obtain the features mentioned before. This features represent the action that the human is carrying out. So that, we compare the symbols of each codebook with the extracted features by minimal distance. It is the moment when we introduce the selected symbols in the Symbols Sequence Vector (SSV). The SSV has the same size number as the number of frames in the video sequence. 2.3 Behavior Recognition

The action recognition module involves two submodules: recognition and prediction. These two phases are working together and the selected data are moving from one to another submodule. There exists a kind of feedback, we can not know the behavior without the help of prediction, and of course, we can not predict without the knowledge of the current state of behavior. Once we have a number of symbols in the SSV, the system begins to scan this vector through a systematic procedure called sliding Window. This Window has a variable size W, which depends, once again, on the application target. This Window allows us to know the current symbol that matches with the current state of behavior and we can also use the symbols that matches with the last n states of behavior. The system gets these last and current symbols and it compares them with the declared codebooks. When a symbol matches with one of the symbols inside a codebook, we add a new value to a vector associated with each codebook. At the end of this process we check how many values have been added in these vectors. In order to know the probability of a current state, the system multiply these added values by a number. This number is obtained dividing the maximum probability of an event by the size of the Window, that is, the number of states that we want to observe. Finally, the event with the greater probability is the selected state. It is very important to bear in mind that this overall process happens very fast. If we choose, for example, to observe the last ve frames in order to determine the current state of behavior, we are watching no more than half a second in the video sequence. 2.4 Next State Prediction

The following is a more complex situation. What we have mentioned above could be used for classifying single actions. A man performs a series of actions, and we recognize which action he is performing now. One may want to recognize the action by classication of the posture at current time T. However, there is a problem. By observation we can classify postures into two classes, including key


J. Calvo et al.

postures and transitional postures. Key postures uniquely belong to one action so that people can recognize the action from a single key posture. Transitional postures are interim between two actions, and even human cannot recognize the action from a single transitional posture. Therefore, human action can not recognized from posture of a single frame due to transitional postures. So, we refer a period of posture history to nd the action human is performing. We need to use the same sliding window as before for real-time action recognition. At current time T, symbol subsequence between T-W and T, which is a period of posture history, is used to recognize the current action by computing the maximal likelihood. As we have seen before, each codebook is composed by symbols. These symbols are ordered over the Real Line, (e.g. symbols of the walking codebook could be 13, 14, 15, 16). With the value of these symbols we can make an arithmetic mean (e.g. in last example it would be 14.5). Therefore, this arithmetic mean can be made over any set of symbols that belong to codebooks. In fact, this is the case of the Window. Depending of its size W, we can also get an arithmetic mean. Thus, we can compare the arithmetic mean that belongs to the codebook, that represents the current posture, with the arithmetic mean of symbols inside the sliding Window scheme. By means of measuring distances between the arithmetic means of the possible future postures, we assign a relative inuence to two variables, dened as alfa () and beta ( ). If the distance between the arithmetic mean of the codebook symbols, that represent current posture, is smaller than its own arithmetic mean, we can guess that the posture is changing to another. Therefore, the system assigns a big relative inuence to , otherwise, big relative inuence goes to . It is obvious that we can nd several cases. So that, and values are continuously changing. In order to predict the next state, we use these variables together with current posture number and a predicted posture one. These numbers are very easy to obtain. They came from the last iteration, it is very simple, each state has a number. The result of the next equation is also a number that is next to one of those state numbers mentioned before, therefore, the result would be the next state. It is well known that the predicted posture can be the same as the current one if the human is not performing any change. N extState = ( CurrentStateN umber) + ( P redictedStateN umber) In one word, the system is selecting two positions per frame, current and predicted position.

Experimental Results

In order to evaluate the quality and the performance of our work, we have implemented a system that is able to identify and recognize a series of actions. Recognition over series of actions is carrying out in a real-time framework. The proposed system is able to show the current action that the human is performing and the predicted behavior. Predicted behavior has a relative inuence in current action as we have explained before.

Context Information for Human Behavior Analysis and Prediction


Fig. 2. States Diagram used in experiments

The video content was captured by a digital camera. The number of frames is chosen experimentally. Too short sequences do not provide enough time for refreshing the necessary data, on the other side, too long sequences are quite dicult to manage. Video sequences have been recorded outside. We remembered not to carry out many assumptions, therefore we consider that the system is quite solid. Recording outside is even more dicult but more realistic, we do not believe in prepared scenarios that bear in mind a big quantity of assumptions, (see Figure 4). In our sequence of experiments, we have tried to keep one subject performing the actions. Five Codebooks have been declared, one codebook per action. Each codebook is composed by four symbols, (see Figure 3).We can observe the dierent actions considered in the states diagram, (see Figure 2).Transitions in the diagrams represent the level of freedom for human behavior.

Fig. 3. Simplied example of matching


J. Calvo et al.

Fig. 4. Example of recorded frame

Fig. 5. Confusion matrix for recognizing a series of action

Observing Confusion Matrix, (see Figure 5), it is easy to appreciate results with more than 90% of success. These results belong to the proposed experiment mentioned above. Percentage refers to the total number of recorded frames. Each frame has its ground truth, so that, we have measure how many frames were correctly recognized. The left side is the ground truth of action types, and the upper side is the recognition action types. The number over the diagonal is the percentage of each action which is correctly classied. The percentages which are out of the diagonal are errors related to action recognition, but we can know which kind of action is identied by the system. Observing the table we can appreciate that most of the postures sequences were correctly classied. A great recognition rate around 90% was achieved by the proposed method. Confusions happen when the human is not moving, depending on the accurate of tracking, centroid is shaking, so that, system interprets it like a movement. The unknown period is the time during which human performs actions that are not dened, transitions between postures.

Conclusions and Future Work

We have showed an interesting and solid system for human action recognition based on the information of the extracted features. These symbols are

Context Information for Human Behavior Analysis and Prediction


transformed into numbers when they are mapped into a structure called Symbols Sequence Vector. Action recognition is achieved by distances between these extracted features and codebooks, one codebook per human action. The recognition accuracy could still be improved whether tracking system were optimized. The system is also able to predict the next state, helping with some relative inuence to show the current action performed. The system is able to identify a series of actions in an outside background that can be changeable. Not many assumptions and constraints are required. The system could be improved in some ways. We can transform that distance method into a fuzzy logic structure. Therefore, the future of our approach is training the system. The system could also be scaled up, that is, we can add more and more kind of actions, depending on the target of the system. It would be easy to achieve some experiments, i.e security applications, we only should add new states representing the chosen eld with the suitable codebooks like ghting, snooping, stealing, etc.

