3D Display Simulation Using Head Tracking With Microsoft Kinect (Printing)

Manfredas Zabarauskas
3D Display Simulation Using Head-Tracking with Microsoft Kinect
Computer Science Tripos, Part II University of Cambridge Wolfson College May 14, 2012
Proforma
Name: College: Project Title: Manfredas Zabarauskas Wolfson College 3D Display Simulation Using Head Tracking with Microsoft Kinect Examination: Part II in Computer Science, June 2012 Word Count: 11,9761 Project Originator: M. Zabarauskas Supervisor: Prof N. Dodgson
Original Aims of the Project

The main project aim was to simulate depth perception using motion parallax on a regular LCD screen, without requiring the user to wear glasses/other headgear or to modify the screen in any way. Such simulated 3D displays could serve as a stepping-stone between full-3D displays (providing stereopsis depth cue) and currently pervasive 2D displays. The proposed approach for achieving this aim was to use viewers head tracking based on colour and depth data provided by the Microsoft Kinect sensor.
Work Completed
In order to detect the viewers face, a distributed Viola-Jones face detector training framework has been implemented, and a colour-based face detector cascade has been trained. To track the viewers head, a combined colour- and depth-based approach has been proposed. The combined head-tracker was able to predict viewers head center location within less than 1 3 of heads size from the actual head center on average. A proof-of-concept 3D display system (using a created head-tracking library) has also been implemented, simulating pictorial and motion parallax depth cues. A short demonstration of the working system can be seen at http://zabarauskas.com/3d.
Special Diculties
None.
Computed using detex diss.tex | tr -cd 0-9A-Za-z \n | wc -w excluding proforma and appendices .
1
Declaration
I, Manfredas Zabarauskas of Wolfson College, being a candidate for Part II of the Computer Science Tripos, hereby declare that this dissertation and the work described in it are my own work, unaided except as may be specied below, and that the dissertation does not contain material that has already been used to any substantial extent for a comparable purpose. Signed Date
ii
Contents
1 Introduction 1.1 Motivation . . . . . . . . . . . . 1.2 Human Depth Perception . . . 1.2.1 Depth Cue Comparison . 1.3 Related Work on 3D Displays . 1.4 Applications . . . . . . . . . . . 1.5 Detailed Project Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 3 4 4 5 5 5 6 7 7 7 7 8 12 14 16 17 20 21 22 22 22 22 24 24 24 24 25 25 27 27 28 32 34 35 37
2 Preparation 2.1 Starting Point . . . . . . . . . . . . . . . . . . . . 2.2 Project Methodology . . . . . . . . . . . . . . . . 2.3 Requirements Analysis . . . . . . . . . . . . . . . 2.3.1 Risk Analysis . . . . . . . . . . . . . . . . 2.3.2 Problem Constraints . . . . . . . . . . . . 2.3.3 Data Flow and System Components . . . . 2.4 Image Processing and Computer Vision Methods 2.4.1 Viola-Jones Face Detector . . . . . . . . . 2.4.2 CAMShift Face Tracker . . . . . . . . . . 2.4.3 ViBe Background Subtractor . . . . . . . 2.5 Depth-Based Methods . . . . . . . . . . . . . . . 2.5.1 Peters-Garstka Head Detector . . . . . . . 2.5.2 Depth-Based Head Tracker . . . . . . . . . 2.6 Summary . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
3 Implementation 3.1 Development Strategy . . . . . . . . . . . . . . . . . 3.2 Languages and Tools . . . . . . . . . . . . . . . . . . 3.2.1 Libraries . . . . . . . . . . . . . . . . . . . . . 3.2.2 Development Language . . . . . . . . . . . . . 3.2.3 Development Environment . . . . . . . . . . . 3.2.4 Code Versioning and Backup Policy . . . . . . 3.3 Implementation Milestones . . . . . . . . . . . . . . . 3.4 High-Level Architecture . . . . . . . . . . . . . . . . 3.5 Viola-Jones Detector Distributed Training Framework 3.5.1 Architecture . . . . . . . . . . . . . . . . . . . 3.5.2 Class Structure . . . . . . . . . . . . . . . . . 3.5.3 Behaviour . . . . . . . . . . . . . . . . . . . . 3.6 Head-Tracking Library . . . . . . . . . . . . . . . . . 3.6.1 Head-Tracker Core . . . . . . . . . . . . . . . 3.6.2 Colour-Based Face Detector . . . . . . . . . . 3.6.3 Colour-Based Face Tracker . . . . . . . . . . . iii
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
3.7 3.8
3.6.4 Colour- and Depth-Based Background Subtractors . 3.6.5 Depth-Based Head Detector and Tracker . . . . . . 3.6.6 Tracking Postprocessing . . . . . . . . . . . . . . . 3D Display Simulator . . . . . . . . . . . . . . . . . . . . . 3.7.1 3D Game (Z-Tris) . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
38 39 40 42 45 47 48 48 48 50 52 52 55 55 55 70 74
4 Evaluation 4.1 Viola-Jones Face Detector . . . . . . . . . 4.1.1 Training Data . . . . . . . . . . . . 4.1.2 Trained Cascade . . . . . . . . . . 4.1.3 Face Detector Accuracy Evaluation 4.1.4 Face Detector Speed Evaluation . . 4.1.5 Summary . . . . . . . . . . . . . . 4.2 HT3D (Head-Tracking in 3D) Library . . . 4.2.1 Tracking Accuracy Evaluation . . . 4.2.2 Performance Evaluation . . . . . . 4.3 3D Display Simulator (Z-Tris) . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
5 Conclusions 75 5.1 Accomplishments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 A Depth Cue Perception A.1 Oculomotor Cues . . A.2 Monocular Cues . . . A.2.1 Pictorial Cues A.2.2 Motion Cues . A.3 Binocular Cues . . . B 3D B.1 B.2 B.3 B.4 B.5 80 80 80 81 82 83 84 84 85 86 86 87 87 87 89 89 89 92 93 94
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Display Technologies Binocular (Two-View) Displays . . . . . . . . . . . . Multi-View Displays . . . . . . . . . . . . . . . . . . Light-Field (Volumetric and Holographic) Displays . 3D Display Comparison w.r.t. Depth Cues . . . . . . 3D Display Applications . . . . . . . . . . . . . . . . B.5.1 Scientic and Medical Software . . . . . . . . B.5.2 Gaming, Movie and Advertising Applications
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
C Computer Vision Methods (Additional Details) C.1 Viola-Jones Face Detector . . . . . . . . . . . . . C.1.1 Weak Classier Boosting using AdaBoost . C.1.2 Best Weak-Classier Selection . . . . . . . C.1.3 Cascade Training . . . . . . . . . . . . . . C.2 CAMShift Face Tracker . . . . . . . . . . . . . . . iv
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
C.2.1 Mean-Shift Technique . . . . . . C.2.2 Centroid and Search Window Size C.3 ViBe Background Subtractor . . . . . . . C.3.1 Background Model Initialization . C.3.2 Background Model Update . . . .
. . . . . . . Calculation . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
94 94 95 95 96 97 97 97 97 98 98 100
D Depth-Based Methods (Additional Details) D.1 Depth Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . D.1.1 Depth Shadow Elimination . . . . . . . . . . . . . . . . . . . . . D.1.2 Real-Time Depth Image Smoothing . . . . . . . . . . . . . . . . D.2 Depth Cue Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2.1 Generalized Perspective Projection . . . . . . . . . . . . . . . . D.2.2 Real-Time Shadows using Z-Pass Algorithm with Stencil Buers E Implementation (Additional Details) E.1 Viola-Jones Distributed Training Framework . . . . . . . . E.2 HT3D Library . . . . . . . . . . . . . . . . . . . . . . . . . E.2.1 Head Tracker Core . . . . . . . . . . . . . . . . . . E.2.2 Colour- and Depth-Based Background Subtractors . E.3 3D Display Simulator Components . . . . . . . . . . . . . E.3.1 Application Entry Point . . . . . . . . . . . . . . . E.3.2 Head Tracker Conguration GUI . . . . . . . . . . E.3.3 3D Game (Z-Tris) . . . . . . . . . . . . . . . . . . . F HT3D Library Evaluation (Additional Details) F.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . F.1.1 Sequence Track Detection Accuracy . . . . . . . . . F.1.2 Multiple Object Tracking Accuracy/Precision . . . F.1.3 Average Normalized Distance from the Head Center F.2 Evaluation Set . . . . . . . . . . . . . . . . . . . . . . . . . F.2.1 Viola-Jones Face Detector Output . . . . . . . . . . F.2.2 Metric for Individual Recordings . . . . . . . . . F.2.3 MOTA/MOTP Evaluation Results . . . . . . . . . G 3D Display Simulator G.1 Automated Testing G.2 Manual Testing . . G.3 Performance . . . . H Sample Code Listings I Project Proposal
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
104 . 104 . 104 . 104 . 104 . 110 . 110 . 110 . 111 116 . 116 . 116 . 117 . 117 . 120 . 120 . 121 . 141
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
(Z-Tris) Evaluation 142 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 145 153
Chapter 1 Introduction
This chapter describes the motivation for a three-dimensional display simulation using Microsoft Kinect, the basic workings of the human depth perception (in order to understand how it could be simulated), the related work that has been done on the 3D display simulation, and the main applications for 3D displays.
1.1
Motivation
The ideas and research about three-dimensional displays can be traced back to mid-nineteeth century, when Wheatstone rst demonstrated his ndings about stereopsis to the Royal Society of London. In the last decade a number of usable glasses-free autostereoscopic systems became available, and multiple glasses-based stereoscopic 3D display systems were available for a few decades. Nevertheless, 3D displays have struggled to break out of their niche markets because of their relatively low quality and high price, when compared to the conventional displays. In November 2010, Microsoft has launched the Kinect sensor, containing an IR depth-nding camera. It became a huge commercial success, entering the Guiness World Book of Records
Figure 1.1: Just-discriminable depth thresholds for two objects with distances D1 and D2 as a
function of the logarithm of distance from the observer for the nine depth cues. Depth of two objects D is represented by their average distance D1 +D2 , depth contrast is obtained by calculating 2(D11+D22 ) . 2 D Reproduced from Cutting and Vishton, 1995 [13].
CHAPTER 1. INTRODUCTION
as the fastest selling consumer electronic device with 18 million units sold as of January 2012. Based on this new development, an idea was conceived to explore the applicability of the cheap and ubiquitous Kinect sensor in creating the depth perception on existing widespread high-quality single-view displays. The crucial rst step in developing such system is to understand the main principles of the human depth perception.
1.2
Human Depth Perception
According to Goldstein [19], all depth cues can be classied into three major groups. 1. Oculomotor cues (based on human ability to sense the position of the eyes and the tension in the eye muscles), 2. Monocular cues (use the input just from one eye). 3. Binocular cues (use the input from both eyes). These major groups (together with the denitions used in the rest of this chapter) are fully described in appendix A.
1.2.1
Depth Cue Comparison
The relative ecacy and importance of various depth cues has been summarized by Cutting and Vishton [13]. Figure 1.1 presents the just-discriminable depth thresholds as a function of 2 - 30 m Depth cue Occlusion Relative size Relative density Relative height Atmospheric perspective Motion parallax Convergence Accommodation Stereopsis 0-2 m 1 4 7 8 3 5.5 5.5 2 All sources 1 3.5 6 2 7 3.5 8.5 8.5 5 Pictorial sources > 30 m 1 3 4 2 5 1 2 4.5 3 4.5 6 8.5 7 7
Table 1.1: Ranking of depth cues in the observers space, obtained by integrating the area under each
depth-threshold function from gure 1.1 within each spatial region, and comparing relative areas. Lower rank means higher importance, a dash indicates that data was not applicable to source depth cue. Based on Cutting and Vishton, 1995 [13].
Figure 1.2: A sample taxonomy of 3D display technologies. Italic font indicates autostereoscopic
displays.
the logarithm of distance from the observer for each of the depth cues, and table 1.1 describes the relative importance of these depth cues in three circular areas around the observer. In particular, occlusions, stereopsis and motion parallax are distinguished as the most important cues for depth perception in low to average viewing distance ranges.
1.3
Related Work on 3D Displays
Physiological knowledge about human depth cue perception has been extensively applied in 3D display design, and multiple ways to classify such displays have been presented in the literature [40, 4, 14, 23]. A sample taxonomy of currently dominating 3D display technologies is given in gure 1.2. Table 1.3 compares these display types with respect to the depth cues that they can simulate, and the special equipment that they require, while a much broader discussion is given in appendix B. Table 1.2: Dierent display type comparison with respect to the depth cues that they provide,
and requirements for special equipment. Requirements Display type
Binocular Multi-view Light-eld2 Proposed
1 2
Simulated depth cues

Pictorial Stereopsis Motion parallax Accomm. & conv. match
Head tracking
Eyewear
Standard LCD/CRT monitor
Continuous Discrete1 Continuous Continuous
Typically only in horizontal direction. Light-eld displays still remain largely experimental (as described by Holliman et al. in [24]).
1.4
Applications
Dodgson [14] distinguishes two main classes of applications for the autostereoscopic 3D display systems:
Scientic and medical software, where 3D depth perception is needed for the successful completion of the task, Gaming and advertising applications, where the novelty of a stereo parallax is useful as a commercial selling point.
Examples from these two application classes are discussed in appendix B.5.
1.5
Detailed Project Aims
To achieve the main projects aim (to simulate depth perception on a regular LCD screen through the use of the ubiquitous and aordable Microsoft Kinect sensor, without requiring the user to wear glasses or other headgear, or to modify the screen in any way), the project will simulate
pictorial depth cues: lighting, shadows, occlusions, relative height/size/density and texture gradient (by implementing an appropriate three-dimensional scene in a 3D rendering framework), continuous horizontal and vertical motion parallax, through real-time head tracking using Microsoft Kinect sensor.
The project will not aim to simulate stereopsis, because it would require modications to the screen (a standard LCD display inherently provides a single view that is seen binocularly). Because of the same reason, simulating depth perception for multiple viewers using a single view will not be attempted. Since motion parallax is one of the strongest near- and middle-range depth perception cues, its simulation through viewers head tracking will be one of the main focal points of the project. It is obvious from the start that in order to achieve accurate head tracking, a signicant amount of computer vision and signal processing techniques will be required. Even more importantly, these algorithms will have to be extended to use the depth information provided by Microsoft Kinect sensor. These tasks require a careful consideration of various tractability issues and a lot of attention to the computer vision techniques before embarking on the project. They will be discussed in much more detail in the following chapter.
Chapter 2 Preparation
This chapter outlines the planning and research that was undertaken before starting the implementation of the project. In particular, it discusses the starting point, the main requirements of the overall system, project methodology, risk analysis and the problem constraints. It then describes the most important theory and algorithms that were used in the project, paying particular attention to the computer vision techniques.
2.1
Starting Point
Before starting the project I had

basic knowledge of Microsoft Visual Studio development environment and C# programming language (six months working experience), next-to-zero practical experience with the OpenGL rendering framework, no experience with the Kinect SDK, no experience with the relevant machine learning and computer vision techniques.
2.2
Project Methodology
Agile Software Development philosophies [3] were followed in requirement analysis, design, implementation and testing. More precisely,
Requirements analysis (section 2.3) was based on the usage modelling. System design (section 3.4) was focused on
process modelling (through data ow diagrams), and architectural modelling (through component diagrams).
Implementation was focused on
constant pace with clear milestones and deliverables (following the project proposal), iterative development with weekly/bi-weekly iteration cycles, continuous integration, where working software is extended weekly/bi-weekly by adding new features, but is always kept in the working state, 5
CHAPTER 2. PREPARATION
Testing was based on agile approaches (functional, sanity and usability manual testing performed continuously throughout the iteration) and automated regression unit tests (performed at the end of an iteration).
2.3
Requirements Analysis
The variety of use cases, scenarios and applications of depth displays are described in section B.5. To limit the scope of the project into something manageable in the Part II project timeframe, and at the same time to design concrete deliverables that would achieve the main aim of the project (as described in section 1.5), two simple user stories are given in the table 2.1. User: 3D Application Developer As an application developer who wants to create her own 3D application on a regular display, I want to easily obtain viewers head location information so that I could use it to render my depth-aware application accordingly. User: Gamer As a gamer, I want to experience a higher sense of realism when playing a 3D game, so that I could a) perform tasks that requires depth estimation more easily and b) I could experience a higher level of imersiveness in the game.
Table 2.1: User stories for the agile requirement analysis of the project. Extrapolating from these two simple user stories, the deliverables of the project (and the main requirements for them) can be dened more precisely: 1. A head-tracking library that can be used to easily obtain the viewers head location in three-dimensions. The main requirements (in the order of their priority) are: (a) Accuracy: head-tracker should be able to correctly detect a single viewers head in the majority of input frames (i.e. the average distance between the trackers 1 prediction and the actual head center in the image should not exceed 2 of the viewers head size), (b) Performance: head-tracker should work in real-time (i.e. should process at least at 30 frames-per-second), (c) Ease of use: library should be exible enough to be used in multiple projects. 2. A simple 3D game that simulates depth perception and requires the user to accurately estimate depth in order to achieve certain in-game goals. The main requirements are: (a) Continuous vertical and horizontal motion parallax depth cue simulation, (b) Pictorial depth cues simulation (lighting, shadows, occlusions, relative height/size/density and texture gradient), (c) In-game goal system requiring the player to estimate depth accurately.
2.3.1
Risk Analysis
Undoubtedly, the biggest challenge and the highest uncertainty associated with these deliverables is the requirement for the accurate real-time head tracking. For this reason, the remaining sections of this chapter (and a very signicant part of the overall dissertation) are focused on successfully implementing viewers head tracking using colour and depth information provided by Microsoft Kinect.
2.3.2
Problem Constraints
As described in section 1.5, depth perception simulation for multiple viewers will not be attempted because it would require modications to the screen (a standard LCD display inherently provides a single view). This reduces the complexity of head-tracking, since only a single viewer needs to be tracked. Furthermore, observe that the reference point of the tracked head location is Kinect sensor. Since the location of the sensor might not necessarily coincide with the position of the display, the constraint that the Kinect sensor must always be placed directly above the display is imposed. This helps to avoid complicated semi-automatic calibration routines.
2.3.3
Data Flow and System Components
Head-tracking task as the main task of the project (as described in the requirements and risk analysis) can be formalized as a sequence of data transformations, where input data is the depth and colour streams coming from Microsoft Kinect and the transformed data is the location of the viewers head w.r.t. the display. Based on the background research, this transformation can be broken down into individual components as shown in gure 2.1. Each of these components can be developed, tested and rened nearly-independently from others. This modular approach makes testing and debugging process much easier, and maximises the opportunity for the code reuse. It also closely adheres to the iterative prototyping style, as one of the most important agile software engineering methodologies. The following sections describe the relevant theory needed to successfully implement these individual components and the actual implementation details are given in Chapter 3.
2.4
Image Processing and Computer Vision Methods
This section introduces the rst three colour stream transformation algorithms (as shown in gure 2.1), viz.:
Figure 2.1: Project data ow as a sequence of data transformations performed by corresponding

algorithms. Transformations with dashed borders are optional.
viewers face detection using Viola-Jones object detection framework (specically trained for human faces), face tracking using CAMShift object tracker, and image segmentation into foreground and background using ViBe background subtractor, to improve tracking and detection tasks.
2.4.1
Viola-Jones Face Detector
Face detection in unconstrained images is a dicult task due to large intra-class variations:
dierences in facial appearance (hair, beards, glasses), changing lighting conditions, within- and out-of image plane head rotations, changing facial expressions, impoverished image data, and so on.
In 2001, Paul Viola and Michael Jones in their seminal work [41] proposed a machine learningbased generic object detection framework. It became a de facto standard for face detection due to its rapid image processing speed and high detection accuracy. Viola-Jones object detection framework is based on the general classication framework, i.e. given a set of N examples (x1 , y1 ), ..., (xN , yN ) where xi X are the feature vectors and yi {0, 1} is the class of the training example (non-face/face respectively), the goal is to nd a classier h : X {0, 1} such that the error of the misclassication would be minimized.
Figure 2.2: Three classes of features (two-rectangle, three-rectangle and four-rectangle) used in ViolaJones algorithm. The value of the feature (h) is dened as the sum of pixel intensities in the black region B subtracted from those in the white region W , i.e. h = (x,y)B I(x, y) (x,y)W I(x, y).
Figure 2.3: Integral image representation used

in Viola-Jones algorithm. The value of the integral image I at coordinates (x, y) is equal to I(x, y) = mx,ny I(m, n), where I is the original image.
Figure 2.4: Method to rapidly (in 6-9 array references) calculate rectangle feature values: D = I(x4 , y4 ) C B + A = I(x4 , y4 ) I(x3 , y3 ) I(x2 , y2 ) + I(x1 , y1 ).
2.4.1.1
Features
Instead of using raw pixel intensities as feature vectors in classication, higher-level features are used. There are multiple reasons for doing so: most notably, higher-level features help to encode ad-hoc domain knowledge, increase between-class variability (when compared to within-class variability) and increase the processing speed. Viola-Jones algorithm uses Haar-like features (resembling Haar wavelets used by Papageorgiou et al. [35]), shown in gure 2.2. The rst main contribution of Viola-Jones algorithm is the integral image representation (see gure 2.3) which allows a constant-time feature evaluation at any location or scale. The value of the integral image I at coordinates (x, y) is equal to the sum of all pixels above and to the left of (x, y), i.e. I(x, y) =
x x,y y
I(x , y ),
(2.1)
where I is the original image. Then the sum of the pixel intensities within an arbitrary rectangle in the image can be computed
CHAPTER 2. PREPARATION with four array references (as shown in gure 2.4). Note that I itself can be completed in one pass over the image using recurrences R(x, y) = R(x, y 1) + I(x, y) I(x, y) = I(x 1, y) + S(x, y), where R is the cumulative row sum and R(x, 1) = 0, I(1, y) = 0.
10
(2.2)
However, for the base resolution of the detector (24 24 pixels), the total count of these rectangular features is 162, 336. Evaluating this complete set would be computationally prohibitively expensive, and unlike the Haar basis, this basis set is many times overcomplete. 2.4.1.2 Ensemble Learning
Viola and Jones proposed that a small number of these features could be chosen to form an eective classier using the boosting techniques, common in machine learning. The actual boosting technique as used by Viola and Jones is called AdaBoost (Adaptive Boosting) and was rst described by Freund and Shapire in 1995 [15]. Shapire and Singer [37] proved that the training error of a strong classier obtained using AdaBoost decreases exponentially in the number of rounds. AdaBoost attempts to minimize the overall training error, but for the face detection task it is more important to minimize the false negative rate than the false positive (as discussed in section 2.4.1.4. Viola and Jones in 2002 [42] proposed the x to AdaBoost, called AsymBoost (Asymmetric AdaBoost). AsymBoost algorithm is specically designed to be used in classication tasks where the distribution of positive and negative training examples is highly skewed. The precise details and the explanations of both AdaBoost and AsymBoost techniques are given in the appendix C.1.1. 2.4.1.3 Weak classiers
For the purpose of face detection, the decision stump weak classiers can be used. An individual classier hi (x, f, p, ) takes a Haar-like feature f , a threshold and a polarity p, and returns the class of a training example x: hi (x, f, p, ) = 1, if p f (x) < p , 0, otherwise. (2.3)
To nd a decision stump with the lowest error t for a given training round t, algorithm C.1.2.1 can be used. It is worth noting that the asymptotic time cost to nd the best weak classier for
11
Figure 2.5: Decision making process in the attentional cascade, when a series of classiers are applied to every sub-window. Due to immediate rejection property, the number of sub-images that reach deep layers of the cascade is drastically smaller than the overall count of sub-images. a given training round is O(KN log N ), where K is a number of features and N is the number of training examples1 . 2.4.1.4 Attentional Cascade
A further observation by Viola and Jones is based on the fact that the face/non-face classes are highly asymmetric, viz. the number of negative sub-images (not containing faces) in a given image is typically overwhelmingly higher than the number of positive sub-images (containing faces). With this insight in mind, it is sensible to focus the initial eort of the detector into eliminating large areas of the image (as not containing faces) using some simple classiers, with progressively more accurate (and computationally expensive) classiers focusing on the rare areas on the image that could possibly contain a face. The idea given in the previous paragraph is embodied in the construction of the attentional cascade (see gure 2.5). It is enough for one of the classiers to reject the sub-image so that the sub-image would be rejected by the whole detector. The sub-image has to be accepted by all classiers in the cascade so that it would be accepted by the detector. Also, each of the classiers in the attentional cascade is designed to have a much smaller false negative rate than the false positive rate; this provides condence that when the classier rejects the sub-image, it is very likely not to have contained face in the rst place. Each of the strong classiers in the cascade is obtained through boosting. A new classier in the cascade is trained on the data that all the previous classiers misclassify, hence in that sense a second classier in the cascade faces a more dicult and time-consuming task than the rst one, and so on. A detailed training algorithm for building a cascaded detector is given in the appendix C.1.3.1.
Putting this into a perspective, to obtain a single strong classier containing 100 weak classiers for 10, 000 training examples and 160, 000 Haar-like features, O(1011 ) operations are needed (assuming constant feature evaluation time).
1
CHAPTER 2. PREPARATION Because of the construction, the false positive rate of the overall cascade is
K
12
F =
i=1
fi ,
(2.4)
where K is the number of classiers in the cascade and fi is the FP rate of the ith classier; similarly, the detection rate of the overall cascade is
K
D=
i=1
Di ,
(2.5)
where di is the detection rate of the ith classier2 . Due to the large Haar-like feature search space, the specics of a strong-classier boosting and the false positive training image bootstrapping for new cascade layers, a careful consideration for the training framework implementation is required (see appendix C.1.3.1 for a back-ofthe-envelope training time estimation for a na implementation). Section 3.5 presents the ve distributed Viola-Jones cascade training implementation and discusses the main methods to tackle the training time complexity in more detail.
2.4.2
CAMShift Face Tracker
After the face in the image has been localized using Viola-Jones face detection algorithm, it can be tracked using CAMShift (Continuously Adaptive Mean Shift) algorithm, rst described by Gary Bradski in 1998 [8]. CAMShift is largely based on the mean shift algorithm [16], which is a non-parametric technique to climb the gradient of a given probability distribution to nd the nearest dominant peak (mode). The mean shift algorithm is given in C.2.1.1, and a short proof of mean shift convergence to the mode of the probability distribution can be found in [8]. CAMShift extends the the Mean Shift algorithm by adapting the search window size to the changing probability distribution. The distributions are recomputed for each frame and zeroth/rst spatial (horizontal and vertical) moments are used to iterate towards the mode of the distribution. This makes CAMShift algorithm robust enough to track the face when the viewer moves in horizontal, vertical and lateral directions, when the minor facial features of the face (e.g. expressions) change or when the face is rotated in the camera plane (head roll). 2.4.2.1 Face Probability Distribution
In order to use CAMShift for face tracking, a face probability distribution function (that assigns an individual pixel a probability that it belongs to a face) needs to be constructed. It
2 Notice that to achieve a detection rate of 0.9 and a false positive rate of 6 106 using a 10-stage classier, each stage has to have a detection rate of 0.99, but a false positive rate of only about 0.3 (i.e. three out of ten non-face images on average are allowed to be misclassied as faces by each strong classier!).
13
Figure 2.6: Conical HSV (hue, saturation, value) colour space. is done by converting the input video frame into the HSV (hue, saturation, value) colour space (shown in gure 2.6) and building the hue histogram of the region in the image where the face was detected. The main reason for using the hue histogram is the fact that all humans (except albinos) have basically the same skin colour hue (as observed by Bradski and veried in [7]. The construction of the hue histogram works as follows. Assume that the hue of each pixel is encoded using m-bits and h(x, y) is the hue of the pixel with coordinates (x, y). Then the unweighted histogram {qu }u=1...2m can be computed using qu =
x,yId
(h(x, y) u) ,
(2.6)
where Id is the detected face region in the video frame. The rescaled histogram {u }u=1...2m can be obtained by calculating q qu = min qu ,1 max({qu }) . (2.7)
Then the face probability of a pixel at coordinates (x , y ) can be calculated using the histogram backprojection, i.e. Pr(I(x , y ) belongs to a face) = qh(x ,y ) . (2.8)
An illustration of the face probability calculated using histogram backprojection is shown in gure 2.7.
14
a)
b)
Figure 2.7: Face probability image b) obtained from input image a) using a histogram backprojection method. Brighter areas of image b) indicate a higher probability for a pixel to be a part of the face.
2.4.2.2
Centroid Calculation and Algorithm Convergence
After the face probability distribution has been constructed, the CAMShift algorithm uses the zeroth and rst moments of the face probability distribution to compute the centroid of the high-probability region (see the appendix C.2.2 for precise details). The mean shift component of the CAMShift algorithm continually recomputes new centroid until there is no signicant change in the position. Typically, the maximum number of iterations in this process is set between 10 and 20, and since the sub-pixel accuracy cannot be observed, a minimum shift of one pixel in either horizontal or vertical directions is used as a convergence criteria3 .
2.4.3
ViBe Background Subtractor
In order to mitigate one of the main drawbacks of CAMShift face tracking algorithm, viz. its inability to distinguish an object from the background if they have similar hue, a separate background/foreground segmentation algorithm can be used. ViBe (Visual Background Extractor) algorithm, as described by Barnich and Van Droogenbroeck [2], is a universal4 , sample-based background subtraction algorithm.
Some care must also be taken to ensure that the algorithm terminates when the search window does not contain any pixels with non-zero face probability distribution, i.e. when the zeroth moment is equal to zero. 4 In a sense that the algorithm itself makes no assumptions about the video stream frame rate, colour space, scene content, background itself or its variability over time.
3
15
Figure 2.8: Comparison of a pixel value v(x) with a set of samples M(x) = {v1 , v2 , ..., v5 } in a two-dimensional Euclidean colour space C1 C2 . Pixel value v(x) is classied as background if the number of samples in M(x) that are within the circle SR (v(x)) is greater or equal to . 2.4.3.1 Background Model and Classication
In ViBe, an individual background pixel x is modelled using a collection of N observed pixel values, i.e. M(x) = {v1 , v2 , ..., vN }, where vi is a background sample value with index i (taken in the previous frames5 ). Let v(x) be the pixel x value in a given colour space, then x can be classied based on its corresponding model M(x) by comparing it to the closest values within the set of samples in the following way. Dene SR (v(x)) to be a hypersphere of radius R in the given colour space, centred on v(x). The pixel value v(x) is classied as background if |{SR (v(x)) M(x)}| , (2.10) (2.9)
where is the classication threshold (see gure 2.8). Barnich and Van Droogenbroeck in [2] have empirically established the appropriate parameter values as = 2 and R = 20 for monochromatic images. The purpose of using a collection of samples is to reduce the inuence of the outliers. To this end, an insight made by Barnich and Van Droogenbroeck is that classication of a new pixel value with respect to its immediate neighbourhood in the colour space estimates the distribution of the background pixels more reliably than the typical statistical parameter estimation techniques applied to a much larger number of samples.
5
See appendix C.3.1 for precise details on the model initialization for the rst frame of the video sequence.
16
Figure 2.9: An example ViBe background model update sequence demonstrating a fast model recovery
in presence of a ghost (a set of connected points, detected as in motion, but not corresponding to any real moving object [38]) and a slow incorporation of real moving objects into the background model.
2.4.3.2
Background Model Update
The background model update method used in ViBe provides three important features: 1. a memoryless update policy (to ensure an exponential monotonic decay of the remaining lifespan for the individual samples stored in background models), 2. a random time subsampling (to ensure that the time windows covered by the background pixel models are extended), 3. a mechanism that propagates background pixel samples spatially (to ensure spacial consistency and to allow the adaptation of the background pixel models that are masked by the foreground). Precise details on the background model update (as shown in gure 2.9) are given in the appendix C.3.2.
2.5
Depth-Based Methods
Akin to the colour-based face detection and tracking approach, a similar two step process is used in viewers head tracking based on the depth data provided by Kinect. Namely, the tracking process is split into head detection using Peters and Garstka method [17] and head tracking using a modied CAMShift algorithm. More details on both of these methods are given in the subsections below.
a)
b)
c)
d)
Figure 2.10: Viola-Jones integral image based real-time depth image smoothing. Input depth image
is preprocessed by a) removing depth shadows, and then is smoothed using b) r = 2, c) r = 4 and d) r = 8, where r is the side length of the averaging rectangle.
17
a)
b)
c)
Figure 2.11: Kinect depth shadow removal. Images a) and b) show the aligned colour and depth
input images from Kinect. Blue areas in the input depth image b) indicate the regions where no depth data is present; image c) is the resulting depth image after depth shadow removal.
2.5.1
Peters-Garstka Head Detector
In 2011, Peters and Garstka [17] introduced a novel approach for head detection and tracking using depth images. Their approach consists of three main steps:
preprocessing of the depth data provided by Microsoft Kinect (depth shadow and noise elimination), described in detail in appendix D.1 and briey illustrated in gures 2.10, 2.11, detection of the local minima in a depth-image and the use of surrounding gradients in order to identify a head (based on certain prior knowledge about the adult head size), discussed below, postprocessing of the head location, discussed in section 3.6.6.
2.5.1.1
Head Detection
After obtaining a smoothed depth image with depth shadows eliminated (as described in appendix D.1), viewers head can be detected using the prior knowledge about the typical adult human head size (20 cm 15 cm 25 cm, length width height) and its shape. Under the assumption that the head is in the upright position, but its orientation with respect to camera is not clear, the inner horizontal bound of the head is chosen to be 10 cm and the outer horizontal bound is chosen to be 25 cm. Note that for a given object with dimensions w h at a distance d from the Kinect sensor, the width ph and height pw in pixels of the area that it occupies on the screen can be calculated using basic trigonometry, i.e. (pw , ph ) = w rw h rh , fw h d 2 tan 2 d 2 tan f2 , (2.11)
18
Figure 2.12: Prior assumptions about the human head shape that are used to detect head-like objects in depth images. The light blue dot is local minimum on a horizontal scan line. where (rw , rh ) is the resolution of the screen and (fw , fh ) is the horizontal/vertical eld of view of the depth camera6 . Using the equation 2.11, the inner and outer bounds can be dened as 28, 865 320 px 10 cm px, 58 d d cm 2 tan 2 320 px 25 cm 72, 162 px. bo (d) = 58 d d cm 2 tan 2 bi (d) = Then for each horizontal line v , consider a local minimum point u for which
all depth values within the inner bounds have a smaller depth dierence than 10 cm from u , and depth values at the outer bounds have a larger depth dierence than 20 cm from u (see gure 2.12 for an illustration).
(2.12)
More formally, nd the point u on the line v such that Ir (u , v ) is a local minimum, and the inequalities Ir (u + f, v ) Ir (u , v ) < 10 cm, Ir (u f, v ) Ir (u , v ) < 10 cm,
6
(2.13)
PrimeSense PS1080 SoC Reference Design 1.081 (http://www.primesense.com/en/press-room/ resources/file/4-primesensor-data-sheet) states 58 horizontal and 45 vertical eld-of-view, which nearly corresponds to Peters and Garstka empirically measured horizontal FOV of 61.66 [17].
CHAPTER 2. PREPARATION hold f 1, 2, ..., bi (Ir (u ,v )) , and 2 Ir u + Ir bo (Ir (u , v )) ,v 2 bo (Ir (u , v )) u ,v 2 Ir (u , v ) > 20 cm,
19
(2.14) Ir (u , v ) > 20 cm,
To match the sides of the head and the vertical head axis more accurately, for each of the local minimum u satisfying the criteria above calculate the positions u1 and u2 of the lateral gradients, where the 20 cm threshold dierence to the local minimum is exceeded, i.e. nd u1 and u2 such that Ir (u1 , v ) Ir (u , v ) 20 cm, Ir (u1 1, v ) Ir (u , v ) > 20 cm, Ir (u2 , v ) Ir (u , v ) 20 cm, Ir (u2 + 1, v ) Ir (u , v ) > 20 cm and use the arithmetic mean u(v ) =
u1 +u2 2
(2.15)
as a possible point on a vertical head axis.
Furthermore, assume that the head height should be of at least 25 cm. To calculate the required head height in pixels, let n be the number of number of subsequent lines on which the points u are found. If u(v ) is found for the current line v , increment n; otherwise set n = 0. The average distance to the points found in the last n subsequent lines can be calculated using 1 d= n
n
Ir ((v i), v i), u

i=0
(2.16)
then the number of lines required for this average distance is 25 cm 240 px 72, 426 nmax = 45 px. d cm 2 tan 2 d If n nmax then the center of a head is treated as detected at coordinates (xc , yc ) = where v is the current horizontal line. An example result of head detection using this method is shown in gure 2.13. 1 n
n1
(2.17)
u(v i), v
i=0
n 2
(2.18)
20
a)
b)
Figure 2.13: Head detection using Garstka and Peters approach. Image a) shows the detected head
rectangle (in yellow) overlaid on top of the colour input image, image b) shows the detected head rectangle overlaid over the smoothed (using r = 2) depth image with depth shadows removed. In both images, white pixels represent the local horizontal minima which satisfy inequalities 2.13 and 2.14.
2.5.2
Depth-Based Head Tracker
After the head is localized by the Garstka and Peters head detector, a modied CAMShift algorithm is used to track the head. The motivation for this approach stems from the fact that one of the main assumptions made by Garstka and Peters (viz. that there is a single head-like object present in the depth frame) ceases to hold in an unconstrained environment. This assumption breakdown would result in the incorrect localization of the vertical head axis, and subsequently lost track of the head. To mitigate this problem, the criteria that Garstka and Peters used to reject these horizontal local minima which could not possibly lie on a vertical head axis (equations 2.13, 2.14) are now used to obtain the face probability in the CAMShift tracker. More precisely, instead of using histogram backprojection to Pr(I(x , y ) belongs to a face), dene a degenerate face probability 1, if Ir (x , y ) is a local minimum on the line y , and inequalities 2.13 Pr(I(x , y ) belongs to a face) = and 2.14 hold, 0, otherwise. obtain
(2.19)
Since the only non-zero probability pixels in the search-window are likely to be positioned on the vertical head axis, the function that is used to obtain the size of the next search window is also updated to s = 4 M00 (where the multiplicative constant was established empirically). This re-denition of the face probability ensures that even when there are some other headlike objects present in the frame, CAMShift algorithm will keep tracking the head which was
21
a)
b)
c)
Figure 2.14: Head tracking using depth information. Images a), b) and c) show the head rectangle (in
yellow) overlaid on top of the colour input image. White pixels represent non-zero face probabilities derived from the depth image using prior knowledge about human head shape 2.5.1, which are then tracked using CAMShift 2.4.2 algorithm.
initially detected. An example of this method in action is shown in gure 2.14. Since the rest of the depth-based head-tracking algorithm continues in the same manner as the colour-based face tracking using CAMShift, the remaining details can be found in section 2.4.2.
2.6
Summary
Agile methodologies have been applied in the projects requirement analysis (and later, in projects design, implementation and testing). During the planning and research phase, the main requirements of the overall system were broken down into i) a real-time head-tracking in 3D library, and ii) a 3D game simulating pictorial and motion parallax depth cues. The former requirement has been identied as associated with the highest uncertainty (during the risk analysis), hence a very signicant amount of time has been spent researching face/head detection and tracking techniques. Ultimately, a combination of de facto standard methods in the industry (like Viola-Jones face detection or CAMShift face tracking), novel techniques (like ViBe background subtraction and Peters-Garstka depth-based head detection) and self-designed methods (like depth-based head tracking) were chosen to be used in the project.
Chapter 3 Implementation
This chapter provides details on how the algorithms and theory from Chapter 2 are implemented to achieve the main projects aims. It starts by discussing the development environment, languages and tools used, then it introduces a high-level architectural break-down of the system into large components (Viola-Jones Face Detector, Head-Tracking Library and a 3D Display Simulator), and nally it discusses the implementation of these individual components.
3.1
Development Strategy
Early in the project, a decision has been made to implement all required algorithms and methods from scratch. While there are numerous open-source computer vision libraries, they are developed to primarily deal with colour data (e.g. OpenCV), isolating only face detection and tracking routines in these large libraries is a complex and time-consuming task. It was deemed that extending these large libraries in multiple dierent ways to use depth information (see [10] for Viola-Jones extensions using depth, or section 2.5.1 for depth-based head detection and tracking) involves higher risk than implementing a single-purpose, cohesive head tracker.
3.2
3.2.1
Languages and Tools

Libraries
To obtain the depth and colour data from Microsoft Kinect, a free (for non-commercial projects) Kinect SDK 1.0 Beta 2 library [33] (released by Microsoft in November 2011) was chosen. While there are some alternative open-source libraries that can extract depth and colour data from the Kinect sensor, they are not ocially supported by the manufacturer and hence were not used to avoid various compatibility issues. To render the depth cues, OpenGL library was chosen as a de facto industry standard for 3D graphics.
22
CHAPTER 3. IMPLEMENTATION
23
Figure 3.1: (Partial) GANTT chart showing projects status as of 18/12/2011.
24
3.2.2
Development Language
C# was chosen as a main development language because it provides typical advantages of a third-generation programming language (machine independence, human readability) combined with object-oriented programming benets (cohesive and decoupled program modules, clear separation between contracts and implementation details, code re-use through inheritance and polymorphism, and so on). It also has a number of advanced programming constructs, like events, delegates, extension/generator methods, SQL-like native data querying capabilities and lambda expressions. Furthermore, it provides features that are missing from Java (like value types, operator overloading or reied generics) and has a stronger tool support for GUI development. Finally, it is supported by Kinect SDK (which is targeting .NET Framework 4.0) and OpenGL (using a wrapper for C# called OpenTK [34]).
3.2.3
Development Environment
One of the requirements for Kinect SDK is Windows 7/8 OS, hence a Windows-based development environment had to be chosen. Since Visual Studio IDE fully supports the development language C#, and also features built-in code versioning, code testing, and GUI development environments, it was used across the whole project.
3.2.4
Code Versioning and Backup Policy
Apache Subversion (SVN) code versioning system was used for the source code and the dissertation version control, and immediate back-up. A centralized storage space for SVN was set up in the PWF le system. Also, a weekly backup strategy was established, where the source code and the dissertation were mirrored once a week on two 16 GB USB ash drives to protect from data loss.
3.3
Implementation Milestones
Milestones of the project proposal (see appendix I) were carefully followed (except for a single design change outlined in the progress report, viz. replacing 3D hemisphere-tting depth tracker [44] with the Garstka and Peters depth tracker). Any minor delays were covered by the slack time, as planned in the project proposal. A snapshot of the project status as of 18/12/11 is shown in GANTT chart in gure 3.1.
25
3.4
High-Level Architecture
The overall project can be split into the independent development components shown in gure 3.2. Each of these components are discussed in more detail in the following sections.
Figure 3.2: UML 2.0 component diagram of the systems high-level architecture.
3.5
Viola-Jones Detector Distributed Training Framework
According to Viola and Jones [41], the training time for their 32-layer detector was in the order of weeks. Similarly, according to [27], the training of the cascade which is used by the detector turned out to be very time consuming and the [17-layer] cascade was never completed. To mitigate the time complexity of the detector cascade training, a decision was made to exploit the processing power of PWF (Public Workstation Facility) machines1 available at the University of Cambridge Computing Laboratorys Intel lab. A distributed training framework targeting Microsoft .NET 2.0 framework (available on PWF) was designed and implemented,
1
Running Windows XP OS on Intel Core 2 Q9550 Quad CPU @ 2.83 GHz with 3.21 GB of RAM.
26
Figure 3.3: PWF machines at University of Cambridge Computer Laboratorys Intel Lab distributedly
training a Viola-Jones detector framework. Special care was taken to ensure that PWF machines would only be used for training when they are not needed by other people (i.e. most of the training was done during the weekends and term breaks), and that training would not interfere with regular PWF user log-ons.
which trained a 22-layer cascade containing 1, 828 decision stumps in 20 hours, 15 minutes and 2 seconds. While the performance of training framework is further discussed in section 4.1.2, it is worth mentioning that the best-performing rectangle feature selection time has been reduced from nearly 16 minutes in a na single-threaded single-CPU implementation (which would require ve more than three weeks to train a 1, 828 feature cascade) to the average of 38.39 seconds per feature in a distributed multi-threaded implementation using 65 CPU cores2 . Two most time-consuming tasks were parallelized: best weak classier selection (out of 162, 336 rectangle features) when building a strong classier, and the false positive training image bootstrapping for each layer of the cascade3 .
Amount of parallel processing was limited by the number of simultaneous logins (19) allowed by PWF security policy. Out of 19 machines, 18 were running 4 training client instances each, one additional machine was running one server instance and one training client instance. 3 The computational complexity of these tasks is best illustrated by the numbers: each of the 162, 336 rectangle features has to be evaluated on each of the 9, 916 training images (as described in section 4.1.1); best-performing decision stump has to be selected out of those. This process of adding best-performing weak classiers has to be repeated until the individual layer false positive rate and detection rate objectives are met, and new layers have to be added until all training data is learned (in total, 1, 828 decision stumps were added). Similarly, 5, 000 false positive training images have to be bootstrapped for each new layer of the cascade; as the cascade grows, the eort required to nd false positive images increases exponentially.
2
27
Figure 3.4: UML 2.0 deployment diagram of the Viola-Jones distributed training framework architecture.
The architecture and the implementation details of this distributed training framework are described below.
3.5.1
Architecture
To provide a better understanding on how the tasks and the main training data are physically distributed, a deployment diagram of the distributed training framework is shown in gure 3.4. As shown in this diagram, two separate communication channels are used: TCP/IP and CIFS (Common Internet File System, also known as SMB, Server Message Block). A standard client-server architecture with star topology (with server in the center) is used for the framework. This arrangement greatly simplies work coordination and makes it easier to ensure a strict consistency of training results. To avoid bottlenecking servers Ethernet link, the following rule-of-thumb is applied: short messages between server and clients are transmitted over TCP/IP, and CIFS is used for large data exchanges.
3.5.2
Class Structure
Due to the space constraints, a detailed class structure of the framework is given in the appendix E. In particular, the class diagram4 is shown in gures E.1 and E.2, and, while the purpose of
Note that all class and component diagrams given in this chapter have been simplied for the convenience of the reader. The implementation follows the agile methodologies self-documenting code principle, hence the author hopes that some insight into the purpose and responsibilities of individual classes/components can be obtained by examining the names and signatures of the functions that they provide.
4
28
the classes should be self-explanatory from the method signatures, the main responsibilities of the most important individual classes are given in table E.1. All classes were implemented using a defensive programming technique. This proved to be crucially important, since machines repeatedly lost CIFS connections to DS-Filestore, experienced TCP/IP connection time-outs under high network load, were forcefully restarted both to install updates and by other Intel lab users, and so on.
3.5.3
Behaviour
The high-level communication sequence between the server and the clients is given in gure 3.5.
Figure 3.5: UML 2.0 sequence diagram of the high-level communications overview between server
and clients in Viola-Jones distributed training framework. Bounded while rectangle corresponds to line 4 in Build-Cascade algorithm given in C.1.3.1.
Two most time consuming tasks (false positive training image bootstrapping and weak classier boosting using AsymBoost) were both multi-threaded and distributed between clients. The interactions between the clients while performing these tasks are coordinated in the following way: immediately after the connection is established between the server and the client, the server sends to the client the indices of high-resolution negative training images5 which that particular client should use to bootstrap detector-resolution false positive training images for each layer of the cascade.
As shown in the deployment diagram 3.4, all negative training images reside on the DS-Filestore and are accessed through CIFS.
5
29
After receiving the start false positive image bootstrapping command, a client obtains a copy of the current detector cascade and repeatedly executes the algorithm 3.5.3.1. Algorithm 3.5.3.1 Single false positive training image bootstrapping. It requires a high-resolution
negative training image Ii , an exhaustive array of triples Ai = [(x0 , y0 , size0 ), ..., (xn , yn , sizen )] describing all possible locations and sizes of bootstrapping samples for image Ii , and a current detector cascade Ct (x). The result of this algorithm is either a single false positive training image, or Nil if no such image could be found.
False-Positive-Training-Image-Bootstrapping(Ii , Ai , Ct (x)) 1 while Ai .length > 0 2 / Generate a random sample index. / 3 r Random-Betweeen(0, Ai .length 1) 4 5 6 7 8 / Acquire and resize the selected sample. / xcurrent Resize(Sample(Ii , Ai [r]), Base-Resolution) / If the negative sample is misclassied as a face, return it. / if Ct (xcurrent ) = 1 return xcurrent
9 / Otherwise, put the last sample into current samples place and decrement the / 10 / array length marker. / 11 Ai [r] Ai [Ai .length 1] 12 Ai .length Ai .length 1 13 return Nil When a false positive training image is bootstrapped, its standard deviation is calculated and stored in NegativeTrainingImage class. Standard deviation is then used to inversely scale the values of rectangle features, to normalize the variance of all false positive training images, hence minimizing the eect of dierent lighting conditions. It is worth mentioning that can be eciently calculated using integral image (see 2.4.1) technique: dene I 2 to be the squared integral image, then = E[I 2 ] (E[I])2 ,
(h,w) where E[I 2 ] = I hw and E[I] = the false positive training image.
2
I(h,w) , hw
with h, w being the height and width (respectively) of
Each bootstrapped false positive training image is then sent back to server, which assembles them into a new negative training image set. Figure 3.6 explains this interaction pictorially. Similarly, gure 3.7 shows the interactions between the server and clients in the weak classier boosting task (based on AsymBoost algorithm described in section C.1.1.1).
30
Figure 3.6: UML 2.0 interaction overview diagram of the distributed false positive training image
bootstrapping.
31
Figure 3.7: UML 2.0 interaction overview diagram of the distributed weak classier boosting using
AdaBoost algorithm (see C.1.1.1) with AsymBoost extension (see C.1.1.1).
32
3.6
Head-Tracking Library
A good initial insight into the implementation details of the HT3D (Head-Tracking in 3D) library can be obtained by observing the data ow between its various components, as demonstrated in gure 3.8. As shown in gure 3.8, HT3D internal implementation follows a highly modular design, with cohesive, single-purpose components, arranged in star topology and de-coupled from each other. These design features enable a high-degree of exibility to choose which information sources should be used for viewers head tracking and how they should be arranged (this proved to be crucial for the evaluation chapter, in which the performances of dierent trackers with dierent features enabled were compared). Furthermore, this design simplied an interchange of components (as shown by the colour-based background subtractor example) and streamlined testability.
33
Figure 3.8: Data ow diagram of the implemented HT3D library. Numbers on the arrows indicate
the order in which data is passed in a typical head-tracking communication sequence.
CHAPTER 3. IMPLEMENTATION The individual components of HT3D library are further discussed below.
34
3.6.1
Head-Tracker Core
As shown in data ow diagram (gure 2.1) the head-tracker core orchestrates various individual head-tracking components and exposes the HT3D library API to the end-user. A detailed headtracker core class diagram is given in gure E.3 and the implementation-wise responsibilities of the most important classes are discussed in detail in the table E.2.1. Most importantly, head-tracker combines the colour- and depth-based tracker (discussed below) outputs using the algorithm 3.6.1.1. Algorithm 3.6.1.1 Combining colour- and depth-based tracker predictions. Given the inputs C and D (colour- and depth-tracker output rectangles respectively), this algorithm returns the combined head center coordinates (or , in case of the tracking failure). Combine-Trackers(C, D) 1 if C = D = 2 if |C D| = 3 return Rectangle-Center(Average-Rectangle(C, D)) 4 else 5 Reset colour and depth trackers to detecting state. 6 return 7 else 8 if D = 9 return Rectangle-Center(D) 10 if C = 11 return Rectangle-Center(C) 12 return
From the users point of view, head-tracker core API exposes the following tracking output (via HeadTrackFrameReadyEventArgs): 1. Tracking images, rendering one of the options shown in gure 3.9), 2. Detected face image (from Viola-Jones face detector), 3. Tracked head rectangles (from colour and depth trackers), 4. Combined head center position in pixels, 5. Combined head center position in space w.r.t. Kinect sensor. Also HeadTracker class exposes a number of head-tracking settings (as shown in gure E.3), allowing the user to tweak detection and tracking components.
35
a) c = COLOUR FRAME
b) c = DEPTH FRAME
c) c = HISTOGRAM
BACKPROJECTION
d) c = BACKGROUND
SUBTRACTION
e) c = DEPTH FACE
PROBABILITY enabled by executing
HT3D image frame rendering options, headTracker.EnabledRenderingCapabilities[c] = true.
Figure
3.9:
3.6.2
Colour-Based Face Detector
Colour-based face detector component in HT3D library is mainly responsible for localizing the viewers face in colour images using a trained Viola-Jones cascade. For this reason, a large part of distributed Viola-Jones training framework code is reused (in particular, NormalizedTrainingImage, StrongLearner and StrongLearnerCascade classes, together with the RectangleFeature class hierarchy), as shown in gure 3.10. A new ViolaJonesFaceDetector class is added, with the main responsibilities of
Deserializing the strong learner cascade from XML (obtained from the distributed training framework), Providing means to adjust the learner cascade coecients (pre-multiplying each layers threshold with a given constant), Detecting the viewers face given the input colour and depth images.
While the implementation of the rst two responsibilities is trivial, the latter deserves some further attention. As discussed by Burgin et al. [10], cues present in depth data can be used to make face detection faster and more accurate. In particular, the face search space can be reduced from exploring multiple scales at each pixel, to searching for only plausible face sizes at a pixel, given its distance from the camera. This optimization to an exhaustive search is implemented as follows:
36
Figure 3.10: UML 2.0 class diagram of the colour-based face detector component of HT3D library. 1. Given the aligned colour and depth images (provided by Microsoft Kinect SDK), iterate through the pixels in the colour image using a step size = 3 px. 2. For each pixel assume that a potential face is centred there. Set the face height upper and lower bounds to 40 cm and 20 cm respectively, and use equation 2.11 to estimate the face height upper (hu ) and lower (hl ) bounds in pixels. 3. Run Viola-Jones face detector starting at hl resolution, using a scaling factor s = 1.075 to increment the resolution, until hu upper bound is reached or a face is detected. One of the important points in the detection algorithm outlined above is that the face detection is triggered as soon as one of the sub-windows in the image passes through the detector cascade. The reason why the search is terminated immediately at that point is because of the main assumption given in the problem constraints (viz. that only a single viewer is present. Another point to note is that scaling is achieved by scaling the face detector itself, and not the input image. More precisely, given a weak classier as described in section 2.4.1.3, its scaled and variance-normalized version hi,s, which takes a Haar-like feature f , a threshold and a polarity p, and returns the class of an input image x (where s is the scale and is the standard deviation of the input image) can be dened as hi,s, (x, f, p, ) = 1, if pf (x) < s2 p, 0, otherwise. (3.1)
37
3.6.3
Colour-Based Face Tracker
Figure 3.11: UML 2.0 class diagram of the colour-based face tracker component of HT3D library. The class diagram of the colour-based face tracker is shown in gure 3.11. The main responsibilities of the CamShiftFaceTrackerUsingSaturation classes are: CamShiftFaceTracker and
Computing the face probability distribution (described in detail in 2.4.2.1), Calculating the face centroid and the search window size (described in C.2.2), Generating a face probability bitmap, as shown in gure 2.7.
While the implementation of these responsibilities closely follows the theory given in relevant subsections of section 2.4.2, two main implemented extensions are worth mentioning separately:
Tracking using two-dimensional histogram from the hue-saturation colour space (based on the ideas in [1]). This extension is implemented to mitigate one of the well-known deciencies of CAMShift algorithm, viz. the inclusion of the background region if it has a similar hue as the object being tracked.
Figure 3.12 shows the probability images obtained using one- and two-dimensional histogram backprojections in equivalent tracking conditions. Since CamShiftFaceTrackerUsingSaturation class inherits from CamShiftFaceTracker, most of the standard CAMShift tracker code is reused, and, more importantly, the extended tracker becomes interchangeable in place of the old one because of inheritance covariance.
As described by Bradski in [8], a large amount of hue noise in HSV space is introduced when the brightness is low (as can be seen from gure 2.6). Similarly, small changes in the colour of low-saturated pixels in RGB space can lead to large swings in hue. For this reason, brightness (value) and saturation cut-o thresholds (v and s respectively) are
38
a)
b)
c)
d)
Figure 3.12: Probability images obtained from an input image b) using c) hue and d) hue-saturation
histograms (brighter colour indicates a higher probability for the pixel to be part of the face; both histograms initialized with a) the output of the Viola-Jones detector shrunk by 20%). As shown in picture c), using a two-dimensional histogram built in hue-saturation colour space would allow the tracker to maintain track of the object even when the background has a similar hue.
introduced: if the brightness or saturation of a given pixel are below these thresholds, such pixel is ignored when building the colour histograms.
3.6.4
Colour- and Depth-Based Background Subtractors
Figure 3.13: UML 2.0 class diagram of the colour and depth background subtractor components of
HT3D library.
39
a)
b)
Figure 3.14: Depth-based background subtractor operation. If the depth-based head tracker is locked
onto the viewers head in the input image a) (yellow rectangle), then the image can be segmented into background and foreground using pixels distance from Kinect as a decision criterion. In particular, if a given pixel is further away than the viewers head center, then it is classied as background (black), otherwise it is classied as part of the foreground (white), as shown in image b).
Colour- and depth-based background subtractors share a common abstract ancestor class BackgroundSubtractor, which is responsible for creating a background segmentation bitmap (using concrete background subtractor implementations) given a certain background subtraction sensitivity (again, dependant on the concrete implementation). Because of this design, all background subtractors are interchangeable, and mock background subtractors can be used to test the library. Two concrete colour-based subtractors are implemented: ViBe background subtractor (ViBeBackgroundSubtractor class) and an Euclidean distance thresholding based background subtractor (EuclideanBackgroundSubtractor class). Similarly, a depth-based DepthBackgroundSubtractor class is implemented, albeit serving a slightly dierent purpose: to increase the speed and the accuracy of colour-based face detector and tracker6 .
3.6.5
Depth-Based Head Detector and Tracker
Since depth-based head detection and tracking methods are based on the same priors about the human head shape, both the detection and tracking functionality is provided by the DepthHeadDetectorAndTracker class (as shown in gure 3.15). The main responsibilities of the DepthHeadDetectorAndTracker class are:
Preprocessing of depth information (depth shadow elimination and integral-image based real-time depth image blurring),
6 Due to space constraints, further background subtractor implementation details are given in the appendix E.2.2.
40
Figure 3.15: UML 2.0 class diagram of the depth-based head detector and tracker.
Viewers head detection using Peters and Garstka method (implemented closely following section 2.5.1), Head tracking using a modied CAMShift algorithm (implemented following section 2.5.2).
DepthHeadDetectorAndTracker class exposes the means to set the integral-image based blur radius r (as shown in gure 2.10) and to enable/disable the depth shadow elimination (as shown in gure 2.11). The only slight deviation from the theory described in section 2.5.1 when implementing the head detector, is that the minimum head height requirement is relaxed from 25 cm to 15 cm (increasing the detection rate). While this modication can also result in higher amount of false positives, using a modied CAMShift tracker (as described in 2.5.2) to track the detected head prevents this from happening in practice. In particular, the search window expands to the whole head area if the regions with high face probability are connected, or degenerates to the minimum if two dierent physical objects were detected as a single object (since the rst moment becomes relatively small compared to the initial search window size).
3.6.6
Tracking Postprocessing
Noise in depth and colour images creates instabilities when tracking viewers face/head (i.e. even if the viewer is not moving between consecutive frames, the detected head/face positions might slightly dier). The noise sources present in depth images are briey discussed in section D.1.2. The main noise sources present in colour images produced by Kinects RGB camera are:
41
Figure 3.16: UML 2.0 class diagram of the tracking post-processing lters used in HT3D library.
photon shot noise (a spatially and temporally random phenomenon arising due to Poissonlike uctuations with which photons arrive at sensor elements), sensor read noise (voltage uctuations in the signal processing chain from the sensor element readout, to ISO gain and digitization) and quantization noise (analogue voltage signal rounding to the nearest integer value in ADC), pixel response non-uniformity, or PRNU (dierences in sensor element eciencies in capturing and counting photons, due to the variations in their manufacturing), and so on.
To help mitigate these face-/head-tracking noise issues, two simple lter classes are implemented:
ImpulseFilter class serves as an exponentially weighted moving average (EWMA) implementation of an innite impulse response (IIR) lter, attenuating low amplitude jitter in the head movements. Given the input vector xt , the ltered value xt is obtained by calculating
xt = (1 )xt1 + xt ,
(3.2)
where is the smoothing (attenuation) factor. The initial value x0 is equal to the rst value of x obtained, i.e. x0 x0 .
HighPassFilter class implements a discrete-time RC high-pass lter which is used to smooth out the transitions when one of the trackers loses the track of the head/face. Given the input vector xt , the high-pass ltered value xt is obtained by calculating
xt = xt1 + (xt xt1 ).
(3.3)
where is the smoothing factor. The initial value x0 is equal to the rst value of x obtained, i.e. x0 x0 .
42
In the overall HT3D architecture, ImpulseFilter class is used to reduce the noise in the output face-/head-tracking rectangles returned by colour and depth trackers. If both trackers are locked onto the viewers head then the HeadTracker calculates the location of the head centroid as the arithmetic average of two rectangle centers; otherwise the centroid location is equal to the center of the tracking rectangle (if there is one). To avoid a sudden face centroid jump if one of the trackers loses the track of the face, a HighPassFilter is used. In particular, a high-pass ltered frame-by-frame change in centroid positions is subtracted from the predicted face centroid position in that particular frame to obtain the nal prediction (which is returned to the user of the library).
3.7
3D Display Simulator
Figure 3.17: UML 2.0 class diagram of the 3D display simulation program. In the nal part of the projects implementation, HT3D library is used to simulate horizontal and vertical motion parallax in a 3D game. The game is largely based on a Blockout video game (published by California Dreams in 1989), and is basically an extension of Tetris into the third dimension (hence the name, Z-Tris). The purpose of the game is to solve a real-time packing problem by forming complete layers of polycubes which are falling into a three-dimensional pit. For this reason, achieving in-game goals requires accurate depth perception. The high-level class, component and deployment diagrams of the 3D display simulation program are shown in gures 3.17, 3.18 and 3.19 respectively. As illustrated in gure 3.17, 3D display simulator consists of two small UI modules (3D Simulation Entry Point and Head Tracker Conguration), and a larger model-view-controller-based module (Z-Tris). This break-down into self-contained modules is based on very clear individual responsibilities, as described both below (for Z-Tris module) and in appendix E.3 (for UI modules).
43
Figure 3.18: UML 2.0 component diagram of the 3D display simulation program.
Figure 3.19: UML 2.0 deployment diagram of the 3D display simulation program (ZTris.exe) showing
the required run-time components and artifacts.
44
Figure 3.20: An entry-point into the 3D display simulation program (MainForm class).
Figure 3.21: Head-tracker conguration GUI (ConfigurationForm) exposing all available HT3D
library options.
45
Figure 3.22: Screenshot of Z-Tris game: the viewer is looking down into the pit, i.e. the active (transparent) polycube is moving away from the player. Depth perception is simulated using occlusions, relative density/height/size, perspective convergence, lighting and shadows and texture gradient pictorial depth cues.
3.7.1
3D Game (Z-Tris)
As shown in the class diagram in gure E.4, the implementation of the game is based on a MVC (model-view-controller) architectural pattern (incorporating Observer, Composite and Strategy design patterns) [9]. MVC facilitates a clear separation of concerns and responsibilities, reduces coupling, simplies the growth of individual architectural units, supports powerful UIs (necessary for the 3D display simulation) and streamlines testing. Due to the space limitations, the implementations of Model, Controller and part of the View architectural units are discussed in the appendix E.3.3. Since it is very important for the project aims that in the process of rendering the game state, a number of depth cues are simulated pictorially (shown in the in-game screenshot 3.22), the depth-cue rendering part of the View is discussed below.
CHAPTER 3. IMPLEMENTATION 3.7.1.1 Generalized Perspective Projection
46
Occlusions, relative density, height and size, perspective convergence and motion parallax depth cues are simulated using the o-axis perspective projection, as described in section D.2.1. Given the viewers head location in space (obtained by the ConfigurationForm from the HT3D library), the generalized (o-axis) perspective projection matrix G = P M T T can be expressed using OpenGL projection matrix stack as shown in code listing 3.1 (see section D.2.1 for notation reminder).
Listing 3.1: Generalized projection matrix implementation in OpenGL.

GL . M atrixMode ( MatrixMode . Projection ) ; GL . LoadIdentity () ; GL . Frustum (l , r , b , t , n , f ) ; Matrix4 M T = new Matrix4 (vr,x , vr,y , vr,z , vu,x , vu,y , vu,z , vn,x , vn,y , vn,z , 0, 0, 0, GL . MultMatrix ( ref M T ) ; GL . Translate ( -( p e,x , p e,y , p e,z ) ) ;
0, 0, 0, 1) ;
3.7.1.2
Shading
To simulate the lighting depth cue, a default OpenGL Blinn-Phong shading model [6] is used. Vertex illumination is divided into emmissive, ambient, diuse (Lambertian) and specular components, which are computed independently and added together (summarized in eq. 3.4). The colour cv of a vertex v is dened as cv =cv,a + cv,e +
llights
attenuation(l, v) spotlight(l, v) cl,a + vn , 0 vn , 0 cl,d cv,d +

v
max max
lv ||l v|| l+v
(3.4) ,
||l + v||
cl,s cv,s
where cv,e , cv,a , cv,d , cv,s are vertex v materials emissive, ambient, diuse and specular normalized (i.e. between 0 and 1) colours respectively, v is the shininess of vertex v, vn is a normal vector to vertex v and cl,a , cl,d , cl,s are the lights l ambient, diuse and specular normalized colours respectively. Between-vertex pixel values are interpolated using Gouraud [20] shading. In Z-Tris implementation, a scene is lit by a single light source positioned in front of the pit, in the top left corner of the screen.
47
a)
b)
Figure 3.23: Screenshot of the scene in Z-Tris game rendered a) without and b) with shadows, generated using Z-Pass technique. 3.7.1.3 Shadows
Section A.2.1 briey discusses the relative importance of depth cues. In particular, shadows play an important role in understanding the position, size and the geometry of the light-occluding object, as well as the geometry of the objects on which the shadow is being cast [25, 19]. For this reason, a Z-Pass shadow rendering technique using stencil buers (as described in detail in section D.2.2) is implemented. The algorithm itself is slightly optimized in the following way: instead of rendering the unlit scene, projecting shadow volumes and rendering the lit scene outside the shadow volumes again, a fully-lit scene is rendered rst, then the shadow volumes are projected and a semi-transparent shadow mask is rendered onto the areas within the shadow volumes (saving one full scene rendering pass). Figure 3.23 shows the same scene rendered with/without shadows generated using Z-Pass technique.
3.8
Summary
Based on the chosen development strategy, all required computer vision and image processing methods have been successfully implemented from scratch, and integrated into the three main components of the system (Distributed Viola-Jones Face Detector Training Framework, Head-Tracking Library and 3D Display Simulator). These components were developed using industry-standard design patterns and software engineering techniques, strictly adhering to the time frame given in the project proposal. In the nal system, the output from the training framework (face detector cascade) has been integrated into the HT3D library, which was then used by the proof-of-concept 3D display application to simulate pictorial and motion parallax depth cues (see http://zabarauskas.com/3d for a brief demonstration of the system in action).
Chapter 4 Evaluation
This chapter describes the evaluation metrics used and the results obtained for all three major architectural components from Chapter 3 (Viola-Jones Face Detector, Head-Tracking Library and 3D Display Simulator). In particular, Viola-Jones face detector evaluation characterizes the performance of the classier cascade in terms of the false positive counts for a given detection rate. In the head-tracking library evaluation, the librarys performance w.r.t. the average distance and the spatio-temporal overlaps between the trackers prediction and the tagged ground-truth is analysed. Finally, the evaluation of a 3D display simulation program (Z-Tris) examines its correctness using dierent types of testing, and describes its run-time performance.
4.1
4.1.1

Training Data
The positive face training database consisted of 4, 916 upright full frontal images (24 24 px resolution), obtained from [11] (originally assembled by Michael Jones). A sample of the rst 196 images from this set is shown in gure 4.1.
Figure 4.1: First 196 faces from the Viola-Jones face detector positive training image database.
48
CHAPTER 4. EVALUATION
49
Figure 4.2: Viola-Jones negative training images gathered using aerial photograph, foliage, underwater, Persian rug and cave search queries.
50
A collection of 7, 960 negative training images (i.e. images not containing faces) for the rst layer were obtained from the same source. A further set of 2, 384 larger resolution (994 770 px on average) negative training images were manually assembled using a Google Image Downloader [18] tool, using search queries like Persian rug, aerial photograph, foliage, etc. A few examples of such images are shown in gure 4.2. All training images have been converted from 24 bits-per-pixel (bpp) colour images to 8-bpp grayscale bitmaps using Image Magick Mogrify command-line tool [26] (to reduce disk/RAM storage requirements), and stored on DS-Filestore. The amount of training data collected was relatively small compared to Viola-Jones implementation: 32.9 million of non-face sub-windows contained in 2, 384 images, compared to 350 million sub-windows contained in 9, 500 images, collected by Viola and Jones. A decision to stop further data mining was made based on the projects time limitation (negative training images downloaded using Google Image Downloader had to be manually veried not to contain faces, which was a laborious and time-consuming task), storage limitation (DSFilestore storage quota limit has been reached by the current negative training image set) and, most importantly, the specics of the intended use case for the detector being trained. In particular, a colour-based face detector is used only to initialize the face tracker in the video sequence, hence it is enough for a face to be detected within the rst few seconds of use. For a 30 frames/second video rate, viewers face can to be detected in one of the few hundred of initial input frames for the viewer not to experience any signicant discomfort. The use of the face detector in spatio-temporal viewer tracking is therefore much less stringent than a classical face detection in still images task. Based on this observation, thresholds of strong classiers in the input cascade can be increased suciently reducing the false positive rates to compensate for the lack of training data (limited, of course, by the simultaneous reduction in detection rates).
4.1.2
Trained Cascade
Using the distributed training framework, a 22-layer cascade containing 1, 828 decision stump classiers has been trained (see gure 4.4). The rst three selected classiers based on Haarfeatures are shown in gure 4.3).
a)
b)
c)
Figure 4.3: First three Haar-like features selected by AsymBoost training algorithm as the weak
classiers; a) and b) were selected for the rst, c) for the second layer of the cascade.
51
Figure 4.4: Weak-classier count growth for dierent-size (in layers) cascades. Individual layers of the cascade (strong classiers) were trained using AsymBoost (as described in section C.1.1.1. Distributed training of the whole cascade on 65 CPU cores (Intel Core 2 Q9550 @ 2.83 GHz) took 20 hours, 15 minutes and 2 seconds. The breakdown of training time into the main individual tasks is shown in table 4.1. Interestingly enough, it took more time to distribute the bootstrapped false-positive samples between all clients than to actually bootstrap them using the distributed framework. Task Distributed best weak classier search Training data distribution (per layer) Distributed negative training sample bootstrapping (per layer) Distributed negative training sample bootstrapping (per image) Average time (s) 38.39 137.30 86.20 0.0024
Table 4.1: Average execution times for the main distributed Viola-Jones cascade training tasks. Hit and false alarm rates used for each layer in the cascade are shown in table 4.2. Layer Hit rate FP rate 1 2 3 4 5 6 7 8
0.960 0.965 0.970 0.975 0.980 0.985 0.990 0.995 0.625 0.600 0.575 0.550 0.525 0.500 0.500 0.500
Table 4.2: Hit (detection) and false positive rate limits used for each layer of the cascade.
52
Figure 4.5: Three false positive samples (2424 px), misclassied by the 22-layer Viola-Jones detector
cascade.
For each new layer, 5, 000 negative training image samples were bootstrapped using the distributed algorithm described in section 3.5.3. Out of 32,988,622 negative training samples (obtained from the large-resolution negative training images), only 40 samples were misclassied as faces for the last round of training (three of these samples are shown in gure 4.5).
4.1.3
Face Detector Accuracy Evaluation
To compare the performance of the trained cascade with Viola-Jones results, the cascade was evaluated on CMU/MIT [36] upright frontal face evaluation set, containing 511 labelled frontal faces. The receiver operating characteristic (ROC) curve showing the trade-o between the detection and false alarm rates of both cascades is shown in gure 4.6. As expected, the cascade obtained by Viola and Jones performs signicantly better. This performance dierence can be attributed to the fact that Viola and Jones have used 1, 063% more data and have trained additional 16 cascade layers with 4, 232 additional decision stump classiers. Nevertheless, for the purposes of face detection in the context of face tracking, the trained cascade has proven to be completely adequate. In ten minutes of colour and depth recordings for HT3D (Head Tracking in 3D) library evaluation (described below) the trained face detector has achieved 97.9% face detection precision. This result is illustrated in gure F.3, where all face detections in HT3D evaluation recordings are shown.
4.1.4
Face Detector Speed Evaluation
As described by Viola and Jones [43], the speed of the detector cascade directly relates to the number of rectangular features that have to be evaluated per search sub-window. Due to the cascaded structure, most of the search sub-windows are rejected very early in the cascade. In particular, for the CMU/MIT set the average number of decision stump weak classiers evaluated per sub-window is 3.058 (out of 1, 828 present in the cascade)12 .
Cf. 8 weak classiers on average in Viola and Jones cascade. With the strong classier rescaling coecient of 0.4, as used in all HT3D evaluation recordings (see table 4.6).
2 1
53
Figure 4.6: Receiver operating characteristics curves for the distributively trained detector and the
detector cascade trained by Viola and Jones. Both ROC curves were established by running the face detector on CMU/MIT frontal face evaluation set. Face detector search window is shifted by [s], where s is the scale initialized to 1.0 (progressively increased by 25%) and is the shift factor, initialized to 1.0. Duplicate detections are merged if the area of their intersection is larger than half of the area of any individual detection rectangle. To obtain the full ROC curve, the thresholds of individual classiers are progressively increased for the distributively trained detector, decreasing both the detection rate and the false positive count. A false positive rate can be obtained by dividing the false positive count by 69, 055, 978.
54
Figure 4.7: Trained Viola-Jones face detector evaluation tool. Two sample images from MIT/CMU
set are shown; ground-truth is marked with red/blue/white dots, detectors output (produced using default settings, as given in table 4.6) is shown in green.
55
A C# implementation of the face detector achieved comparable performance to the one described by Viola and Jones [43]. In particular, the trained detector was able to process a 384 288 px image in 0.028 seconds on average (achieving 35.71 frames-per-second processing rate), using a starting scale s = 1.25 and a step = 1.5. While the image processing speed achieved by the trained cascade is 239% faster than the speed described in [43] under similar detector settings, it is unclear how much of this speed-up can be directly attributed to the shorter cascade size and a smaller amount of weak classiers evaluated per sub-window3 .
4.1.5
Summary
A 22-layer frontal/upright face detector cascade has been successfully trained in a very short timeframe (less than a day) using the distributed Viola-Jones framework implementation. While the performance of the cascade has been limited by the amount of training data available (over 32.9 million negative training samples have been exhausted), the achieved performance proved to be adequate for the face-tracking tasks. The face detector was also able to process 384 288 px input images at 35.71 FPS, making it suitable for the real-time applications.
4.2
4.2.1
HT3D (Head-Tracking in 3D) Library

Tracking Accuracy Evaluation
The performance of a 3D display simulation program (the main project aim) crucially depends on the accurate viewers head localization in space. To that end, Kinect SDK is used to obtain the relative location of a point in space corresponding to a speculated head-center pixel, hence accurately nding the head-center pixel coordinates is crucial to the overall projects success. 4.2.1.1 Evaluation Data
At the time of writing this dissertation, no standardized benchmark containing both colour and depth data for the face tracking evaluation was available. A set of evaluation data was manually collected using StatisticsHandler class from the HT3D library 3.6.1. All videos in the set were taken to reect conditions that might naturally occur when a single viewer is observing a 3D display, including head rotations/translations, changing lighting conditions, cluttered backgrounds, occlusions and even multiple viewers present in the frame. In total, 10 minutes of depth and colour data feed from Kinect were recorded at 27.5 average FPS (totalling in over 16, 000 frames).
In particular, it is unclear what speed improvement could have been achieved only by using a faster CPU because of dierent operating systems, dierent implementation programming languages, and so on.
3
CHAPTER 4. EVALUATION All scenarios covered in this evaluation set are given in the table 4.3. 4.2.1.2 External Participant Recordings
56
Recordings for participants #1 to #5 were taken as a part of the Measuring Head Detection and Tracking System Accuracy experiment4 . Before the experiment a possible range of head/face muscle motions that can be performed was suggested to each participant. Then each participant was asked to move his/her head in a free-form manner and two colour and depth videos (each 30 seconds long) were recorded5 . 4.2.1.3 Ground-Truth Establishment
In order to establish the head position ground-truth in recorded colour and depth videos, a laborious manual-tagging process is required. To alleviate some of the diculties associated with this process, a video tagging tool named Head Position Tagger was implemented using C# for .NET Framework 4.0 (see gure 4.8). Using this tool, the location of the head in the aligned colour and depth image can be specied 2 by manually best-tting an ellipse. The ratio of minor and major ellipse axes is set to 3 , hence only two points are needed to fully describe an ellipse (viz. the antipodal points on the major axis). These two points are given using the mouse (in a single click-and-drag motion). The position/orientation/size of the ellipse can then be further adjusted using the keyboard. Furthermore, the ground-truth locations are linearly interpolated in between frames, hence only the start and end ground-truth locations need to be established for spatially-continuous head motions. Using this tool, 2, 437 out of 16, 489 frames were tagged, accounting for 17.3% of the total frames (121.85 frames out of 703 were tagged per video on average, with = 36.61), with the rest of the frames interpolated. Around 30 minutes were spent on tagging an individual video. Based on the main projects assumption (viz. presence of a single viewer in the image), a single face was tagged in every frame 6 (including cases where viewers head was partially occluded, or was partially out of frame).
4
The experiment consent form describing the manner of the experiment in more detail is given in appendix
H. Recorded videos were kept in accordance to the Data Protection Act and will be destroyed after the submission of the dissertation. 6 For Multiple viewers scenario, the viewer that was present in the recording for the longest time was tagged.
5
57
Scenario F.4 Head rotation (roll) F.6 Head rotation (yaw) F.8 Head rotation (pitch) F.10 Head rotation (all) F.12 Head translation (horizontal/vertical)
Frame Length count (sec.) 839 828 799 812 831 29.95 29.95 29.98 29.98 29.95 29.98 29.95 30.01 29.94 29.96 29.98 29.95 29.98 29.98 29.98 29.97 29.88 29.97 29.98 29.98 599.3
Brief description Head roll of 70 . Head yaw of 160 . Head pitch of 90 . Combined head roll, yaw and pitch. Head translation in 80% of horizontal FOV, 70% of vertical FOV. Head translation in 80% of Kinects depth range Combined horizontal, vertical and anterior-posterior translation. Combined head roll, yaw, pitch and horizontal, vertical, anterior/posterior translation (6 degrees-of-freedom). Face occlusion, varying facial expressions, partial head movement out of frame. Varying facial expressions, fast spatial motions. Partial and full face occlusion by hair and hands, fast spatial motions, changing facial expressions. Skin-hued clothing, partial face occlusion, varying facial expressions. Changing facial appearance (removing glasses, releasing the hair), partial face occlusion. Dicult lighting conditions (with only the monitor glare illuminating an otherwise dark scene). Single light source moving around the scene. Direct sunlight (with depth data only partially present). Drastically changing facial expressions. Scene with the skin-hue background and multiple skin-hue objects. Full head occlusions by multiple skin-hue and head-shaped objects. Two spectators present in the scene.
F.14 Head translation 821 (anterior-posterior) F.16 Head translation (all) F.18 Head rotation and translation (all) F.20 Participant #1 F.22 Participant #2 F.24 Participant #3 F.26 Participant #4 F.28 Participant #5 F.30 Illumination (low) F.32 Illumination (changing) F.34 Illumination (high) F.36 Changing facial expressions F.38 Cluttered background F.40 Occlusions F.42 Multiple viewers Total: similar-hue 822 787 813 831 846 848 828 788 849 843 819 848 809 828 16, 497
Table 4.3: Head-tracking evaluation set. Each recording consists of uncompressed input from Kinects depth
and colour sensors (320 240 px/12 bits-per-pixel, and 640 480 px/32 bits-per-pixel respectively), and the aligned colour and depth image (320 240 px/32 bits-per-pixel). The total size of all recordings is 24.3 GB.
58
Figure 4.8: Head Position Tagger tool GUI. Frame 112 of the Occlusions recording is being
tagged; head position marker is shown in red.
59
Figure 4.9: Ground-truth objects tagged by two dierent annotators in frames 160, 466 and 617
of the Participant #1 recording. Blue/red ellipses represent objects G1 and G1 respectively for t {160, 466, 617}.
(t) (t)
4.2.1.4
Evaluation Metrics
Three main metrics are used when evaluating dierent tracker performances on the evaluation set recordings7 :
metric, which measures the average normalized distance between the predicted and ground-truth head centers, STDA metric, which measures the spatio-temporal overlap (i.e. the ratio of the spatial intersection and union, averaged over time) between the ground truth and the detected objects, MOTA/MOTP metrics, which evaluate i) tracking precision as the total error in estimated positions of ground truth/detection pairs for the whole sequence, averaged over the total number of matches made, and ii) tracking accuracy as the cumulative ratio of misses, false alarms and mismatches in the recording, computed over the number of objects present in all frames.
4.2.1.5
Inter-Annotator Agreement
Even humans do not entirely agree about the exact location of the head in an image (especially for partially occluded head images, motion-blurred head images, etc). To establish an indication of the upper limit of systems performance, two recordings (Participant #1 and Participant #2) were independently tagged by two annotators (1, 644 frames in total). Inter-annotator agreement was established for STDA, MOTA, MOTP and metrics, with the (t) tracker output object Di in the metric denitions replaced by an object tagged by annotator (t) #2 (denoted Gi ), as illustrated in the gure 4.9. The average distance between head centers as marked by both annotators was approximately 9.8% of the head size (as indicated by measure). Similarly, an 82.9% spatio-temporal overlap ratio for the tagged ground-truths has been achieved (STDA measure).
7
See appendix F.1 for full metric descriptions.
60
Figure 4.10: Inter-annotator metric evolution over time for Participant #1 recording (annotator
#1 ground-truth is arbitrarily used as the baseline).
Complete results for all metrics obtained from Participant #1 and Participant #2 recordings are listed in table 4.5, and the balance of the fault modes can be seen from the confusion matrix 4.4. Annotator #1 Overlap 75% Overlap < 75% 1, 505 46 93 0
Annotator #2
Overlap 75% Overlap < 75%
Table 4.4: Inter-annotator confusion matrix (in # of frames). The overlap is measured as the proportion of the area tagged by both annotators versus the area tagged by just a single annotator (i.e
|G1 G1 |
(t) |G1 | (t) (t)
and
|G1 G1 |
(t) |G1 |
(t)
(t)
).
4.2.1.6
Evaluation Procedure
All scenarios in table 4.3 were tested using the same set of HT3D head-tracker parameters, as given in table 4.6. For each recording, raw depth and colour streams were loaded using StatisticsHandler and fed into the HT3D library core using the same data path as for the live data from the Kinect sensor. The individual colour/depth/combined trackers were initialized at the rst frame of the recording and the predicted head area/head center coordinates in each frame were serialized to an XML le.
CHAPTER 4. EVALUATION Recording Participant #1 Participant #2 Total: Frame count 813 831 1, 644 0.1374 0.0592 0.0979 STDA 0.8122 0.8447 0.8286 MOTA 0.8370 0.8923 0.8650 MOTP 0.8122 0.8447 0.8286
61
Table 4.5: Inter-annotator agreement for all evaluation metrics. The serialized tracker output and the ground-truth data was then loaded into the Head Position Tagger tool, and the report containing evaluated STDA, MOTA, MOTP and metrics was generated. 4.2.1.7 Evaluation Results
Colour, depth and combined tracker8 performances for the evaluation recordings with respect to , STDA, MOTA and MOTD metrics are discussed below. Average Normalized Distance from the Head Center () Results of metric conclusively show that both depth and combined trackers perform better than only the colour tracker on the given input recordings. In particular, both depth and combined trackers performed better than the colour one in 18/20 recordings. While the dierence between the performances of depth and combined trackers is much smaller, a combined tracker still outperforms the depth tracker in 14/20 recordings, and achieves a slightly better total result. Nonetheless, all trackers fell short of the gold inter-annotator agreement standard. For illustration purposes, gures 4.11 and 4.13 show measure evolution over time for Participant #5 and Illumination (high) recordings respectively. Similar analyses for the remaining recordings are given in the appendix F.2.2. A summary of measure for all the recordings is shown in gure 4.16.
Using default settings as given in table 4.6, unless otherwise noted.
62
Frame 70
Frame 174
Frame 240
Frame 298
Frame 363
Frame 458
Frame 531
Frame 537
Frame 621
Frame 705
Frame 751
Frame 822
Figure 4.11: Participant #5 recording. Marked red area indicates the output of the combined
head-tracker.
Figure 4.12: metric evolution over time for Participant #5 recording.
63
Frame 0
Frame 110
Frame 174
Frame 326
Frame 359
Frame 392
Frame 721
Frame 767
Figure 4.13: Illumination (high) recording. Marked red area indicates the output of the combined
head-tracker.
Frame 14
Frame 69
Frame 121
Frame 246
Figure 4.14: Illumination (high) recording depth frames. Blue colour indicates the areas of the
image where no depth data is present. More depth data becomes available towards the end of the recording due to the reduced amount of sunlight in the scene.
Figure 4.15: metric evolution over time for Illumination (high) recording.
64
Figure 4.16: (average normalized distance from the head center) metric for all evaluation recordings (default settings). Lower values indicate better performance.
Figure
4.17:
metric
for
all
evaluation
recordings
(custom
settings:
increased
ColourTrackerSaturationThreshold and ColourTrackerValueThreshold values for Illumination (low), Cluttered similar-hue background and Occlusions recordings).
65
Figure 4.18: STDA (Sequence Track Detection Accuracy) metric for all evaluation recordings (higher
values indicate better performance). Average (mean) STDA metric values for individual trackers are given in the table 4.7.
STDA, MOTA and MOTP Similar to metric, colour-based tracker is nearly always outperformed by both depth and combined trackers with regards to STDA, MOTA and MOTP metrics. In particular, depth and combined trackers perform better in 19/20 and 20/20 recordings respectively for STDA metric (as shown in gure 4.18), 16/20 and 18/20 recordings respectively for MOTA, 20/20 and 19/20 recordings respectively for MOTP metric. Depth and combined trackers also consistently achieve better STDA/MOTA/MOTP results than the colour-only tracker, but, again, do not reach the performance of the gold standard (inter-annotator agreement). Interestingly, a purely depth-based tracker (as described in 2.5.2) performs better than the combined one w.r.t. the measures based on the spatio-temporally averaged ground-truth/detection overlap ratios. In particular, depth-based tracker outperforms the combined tracker in 12/20 recordings using the STDA metric, 12/20 recordings using MOTA metric and 16/20 recordings using the MOTP metric (see gure F.44 for MOTA/MOTP metric values per individual recording). Depth-
66
k=1
k=2
k=4
k=8
Figure 4.19: Kinects colour and depth streams subsampled and rescaled by a factor of k. tracker also achieves slightly better total STDA/MOTA/MOTP results. These results can be partially explained by the fact that the combined tracker uses the intersection of individual colour and depth tracker outputs to produce the nal prediction. This approach can potentially reduce the amount of both false positives (since both trackers have to vote for the pixel to be classied as part of the head) and false negatives (in cases where one of the trackers loses the track). While this approach increases the accuracy of the head-center localization as shown by metric (which is crucial for the projects success), these benets are outweighed by the slight increase in the false negatives occuring in the majority of frames due to the relatively poor performance of the colour tracker and hence decreased intersection area. 4.2.1.8 Robustness to Undersampling and Noise
The eects of spatio-temporal undersampling and the additive white Gaussian noise (AWGN) were also briey investigated. The average normalized distance from head center () metric was calculated for colour, depth and combined head trackers on spatially and temporally undersampled Participant #2 recording (see gure 4.19). The results are summarized in gure 4.20, but in brief, all trackers demonstrated good robustness to undersampling, indicating that these algorithms could potentially be applied for sensors with a lower resolution. Similarly, a varying degree of Gaussian noise was added to both colour and depth streams of the Participant #2 recording (see gure 4.21). The results for the combined tracker are shown in gure 4.22. While all three trackers showed some degree of robustness to noise, it has been observed that the depth tracker was much more error-prone to AWGN in depth stream. This is possibly due to the head detection approach which requires that the horizontal local minima satisfying equations 2.13 and 2.14 would be found in a number of consecutive rows.
67
Figure 4.20: Average Normalized Distance from Head Center () metric for Participant #2 recording (using default tracker settings) under varying degrees of spatio-temporal undersampling. Lower values indicate better performance (notice the dierence in vertical axis range for each of the trackers).
68
=0
= 0.05
= 0.1
= 0.15
=0
= 0.05
= 0.1
= 0.15
Figure 4.21: White Gaussian noise N N (0, 2 ) added to Kinects colour and depth streams.
Figure 4.22: Average Normalized Distance from Head Center () metric for Participant #2 recording with added white Gaussian noise. Noise is distributed around zero, with deviation given as a proportion of a maximum range value (255 for colour data, 4000 for depth data). Lower values indicate better performance.
CHAPTER 4. EVALUATION Setting

BackgroundSubtractorSensitivity ColourDetectorSensitivity ColourTrackerUseSaturation ColourTrackerSaturationThreshold ColourTrackerValueThreshold BackgroundSubtractorType DepthShadowEliminationEnabled DepthSensorBlurRadius ColourTrackerSensitivity DepthTrackerSensitivity CombinedTrackerSensitivity
69 Default value 20 0.4 True 32 64

BackgroundSubtractorType.DEPTH
True 1 0.8 0.8 0.4
Table 4.6: Default HT3D library settings. 4.2.1.9 Summary
Table 4.7 shows the average (mean) metric values for all tracker performances. While the interannotator agreement has not been reached, both depth and combined trackers have demonstrated good performance in recordings containing varying backgrounds and lighting conditions, and unconstrained viewers head movements (with 70 roll, 160 yaw, 90 pitch, anterior/posterior translations within 40-400 cm, and with horizontal/vertical translations within the FoV of the sensor). Tracker Colour Depth Combined colour and depth Inter-annotator agreement STDA MOTA 0.4158 0.5651 0.5574 0.8650 MOTP 0.4438 0.6552 0.6066 0.8286
0.8259 0.3764 0.3554 0.6024 0.3270 0.5926 0.0979 0.8286
Table 4.7: Tracker performance averaged over all evaluation recordings (obtained using default settings). Bold font indicates the best tracker values achieved for a given metric.
In particular, the combined tracker was able to predict the viewers head center location within less than 1 of heads size from the actual head center (most important for the main projects 3 aim), and the depth tracker was able to achieve over 60% spatio-temporal overlap for the predicted head area. Regarding the relative performance of dierent trackers, the main conclusion is that using depth data besides colour data signicantly improves head tracking accuracy (as indicated by all metrics). This is mostly due to the very good performance of the CAMShift algorithm when applied to the head probability distribution, obtained from the depth data using Peters and Garstka priors. The main performance losses of the combined tracking algorithm stemmed from the
70
Table 4.8: HT3D library performance when running conguration GUI for 60 seconds on a dual-core
hyperthreaded Intel Core i5-2410M CPU @ 2.30 GHz, with 8 GB RAM.
Average FPS Colour tracker (no background subtraction) Colour tracker (Euclidean background subtractor) Colour tracker (ViBe background subtractor) Depth tracker Combined tracker Histogram backprojection rendering Background subtraction rendering Depth head probability rendering Depth image rendering
1
Average % CPU Time1 28.820 36.230 48.751 47.135 56.806 64.165 39.578 63.624 63.759
Minimum Maximum % CPU % CPU Time Time 21.839 28.079 42.118 37.438 43.678 56.214 32.726 53.818 53.818 36.658 47.038 56.157 60.837 76.436 74.096 46.018 78.776 74.802
27.833 27.795 27.914 27.568 28.243 27.830 27.460 28.060 27.966
The percentage of time that a single CPU core (with hyperthreading enabled) was busy servicing the process.
inaccuracies of the colour tracker in bad lighting conditions or in presence of other similar-hue objects in the scene.
4.2.2
Performance Evaluation
To successfully achieve the main project aim (3D display simulation) it is crucial that the HT3D DLL performance would reach real-time. In order to evaluate the run-time head-tracking costs in realistic conditions, the performance of a HT3D conguration GUI (see 3.21) was measured for various tracker settings. HT3D conguration GUI was chosen as a good representative program since it introduces only minimal run-time overheads for data rendering (any projects using HT3D library would be likely to incur similar costs). Run-time performance was tested on a main development machine running 64-bit Windows 7 OS on a dual-core hyperthreaded Intel Core i5-2410M CPU @ 2.30 GHz, with 8 GB RAM. A 64bit release build containing no debug information was measured using the Windows Performance Monitor and dotTrace Performance 5.0 [28] tools. Evaluation results are summarized in gures 4.23, 4.24 and in table 4.2.2. In summary, all trackers achieved real-time performance: more than 27.4 frames per second were processed on a single CPU core (with raw input provided by Kinect sensor at 30 Hz).
71
Figure 4.23: Performance of HT3D trackers when running conguration GUI on a dual-core hyperthreaded Intel Core i5-2410M CPU @ 2.30 GHz.
Figure 4.24: HT3D background subtractor performance with a colour tracker when running conguration GUI.
CHAPTER 4. EVALUATION 4.2.2.1 Hot Paths
72
Hot path analysis indicates where most of the work in the process has been performed (also known as the most active function call tree). Due to a rather clumsy depth and colour stream alignment implementation in Kinect SDK Beta 2, over 40% of the total head-tracking time has been spent in aligning the colour and depth images (see gure 4.25). In order perform this alignment, an SDK function GetColorPixelCoordinatesFromDepthPixel is provided. This function takes the coordinates of a depth pixel in the depth image, together with the depth pixel value, and returns the corresponding coordinates of a colour pixel in the colour image. This API design eectively means that in every single frame, for every single pixel (xd , yd ) in the depth image, i) the function GetColorPixelCoordinatesFromDepthPixel has to be called to return the corresponding colour coordinates (xc , yc ), ii) the colour image has to be referenced at coordinates (xc , yc ) to obtain the colour value (r, g, b), and only then iii) the depth pixel (xd , yd ) can be assigned a colour value (r, g, b)910 .
9 This aw has been xed in Kinect SDK v1 (released on 01/02/2012, after the Implementation Finish milestone of the project) where an API for full-frame conversion has been provided via MapDepthFrameToColorFrame function. Assuming that the combined API yields a 10-fold performance improvement, Amdahls law predicts that the overall head-tracker performance could be improved by over 38%, as shown in gure 4.25 part b). 10 Incidentally, the updated image alignment API also supports 640 480 px resolution, hence the overall tracker resolution could be quadrupled from 320 240 px to 640 480 px.
73
a)
b)
Hot path analysis for HT3D library running conguration GUI. The highlighted Kinect SDK function GetColorPixelCoordinatesFromDepthPixel performs the colour and depth image alignment. Image a) shows the unoptimized head-tracker performance (method nui DepthFrameReady), while image b) shows the possible performance improvement obtained from 10-fold reduction of GetColorPixelCoordinatesFromDepthPixel function calls.
Figure 4.25:
74
4.3
3D Display Simulator (Z-Tris)11
Correctness of Z-Tris implementation was evaluated using a combination of automated tests (unit, smoke, regression) and manual (white-box, functional, sanity, usability, integration) tests. Around 85.25% code coverage was achieved by automated unit tests for the core Z-Tris classes (a sample unit test run is shown in gure 4.26).
Figure 4.26: Sample Z-Tris unit test run in Visual Studio Unit Testing Framework. Regarding performance, Z-Tris (with the combined colour and depth head tracker enabled) achieved 29.969 frames-per-second average rendering speed, satisfying the real-time rendering requirements. Also, a single CPU core experienced an average load of 64.98%, indicating that some further processing resources were available.
Only the evaluation summary is presented in this section due to the space limitations; see appendix G for more Z-Tris evaluation details.
11
Chapter 5 Conclusions
5.1 Accomplishments
The main projects aim (to simulate depth perception on a regular LCD screen through the use of the ubiquitous and aordable Microsoft Kinect sensor, without requiring the user to wear glasses or other headgear, or to modify the screen in any way) has been successfully achieved. While static images cannot do justice to the level of depth perception simulated by the system, a short video demonstration can be seen at http://zabarauskas.com/3d.
To achieve the main projects aim, the following new approaches were suggested:
A distributed Viola-Jones face detector training framework. The framework running on 65 CPU cores was able to train a 22-layer detector cascade containing 1, 828 decision stump classiers in less than a day (vs. a three week estimate using a na approach). Training process was limited ve only by the amount of data available (exhausting 32.9 millions of negative training examples). A real-time depth-based head tracker, combining CAMShift tracking algorithm with Peters and Garstka priors. During 10 minutes of colour and depth recordings, the depth-based head-tracker was able to achieve a better than 60% average spatio-temporal overlap ratio between the groundtruth objects and their predicted locations. A real-time combined (colour and depth) head-tracker. During 10 minutes of evaluation recordings (containing unconstrained viewers head movement in six degrees-of-freedom, in presence of occlusions, changing facial expressions, dierent backgrounds and varying lighting conditions) the combined head-tracker was able to predict the viewers head center location within less than 1 3 of heads size from the actual head center (on average).
To the same end, a number of published methods were implemented:

Viola-Jones face detector (with depth cue extensions as suggested in [10]), Depth-based face detector, using Peters and Garstka method, CAMShift face tracker (extended to use both hue and saturation data), and ViBe background subtractor (in itself an extension to the project).
All of these methods were combined into a robust and exible HT3D head-tracking library. Finally, a proof-of-concept application was developed, creating depth perception on a regular LCD display by simulating continuous horizontal/vertical motion parallax (using HT3D DLL) and a number of pictorial depth cues. Such systems could serve as potential backwards-compatibility providers during the transition from 2D to 3D displays (being able to render convincing 3D content on ubiquitous 2D displays).
75
CHAPTER 5. CONCLUSIONS
76
5.2
Future Work
Despite the obvious improvement over the colour-based head-tracker, neither depth, nor the combined trackers have reached the inter-annotator agreement (gold standard) results. The obvious next steps in increasing the tracker performance would be to i) train the Viola-Jones face detector using more data, and ii) port the HT3D library from Kinect SDK Beta 2 to Kinect SDK v1, increasing the depth resolution to 640 480 px (eectively quadrupling the amount of depth data present). A more interesting direction, however, would be to explore the applicability of well-performing colourbased methods to depth data. Possible examples include training a Viola-Jones face detector on depth images (involving the collection of a representative depth data training set), or exploring the applicability of adaptive background subtraction techniques (like ViBe) to depth image sequences. Based on the experience obtained throughout the project, it seems quite likely that these approaches could further improve the tracking accuracy. Furthermore, the head-tracking library could be extended to deal with multiple people (this would involve implementing partial/full occlusion disambiguation and object identication). Running both colour- and depth-based multiple viewer trackers in parallel could potentially provide a signicant advantage over the systems based only on a single information source.
Bibliography
[1] Allen, J. G., Xu, R. Y. D., and Jin, J. S. Object Tracking Using CAMShift Algorithm and Multiple Quantized Feature Spaces. Reproduction 36 (2006), 37. [2] Barnich, O., and Van Droogenbroeck, M. ViBe: A Universal Background Subtraction Algorithm for Video Sequences. IEEE Transactions on Image Processing 20, 6 (2011), 1709 1724. [3] Beck, K., Beedle, M., van Bennekum, A., Cockburn, A., Cunningham, W., Fowler, M., Grenning, J., Highsmith, J., Hunt, A., Jeffries, R., Kern, J., Marick, B., Martin, R. C., Mellor, S., Schwaber, K., Sutherland, J., and Thomas, D. Manifesto for Agile Software Development, 2001. [4] Benzie, P., Watson, J., Surman, P., Rakkolainen, I., Hopf, K., Urey, H., Sainov, V., and Kopylow, C. V. A Survey of 3DTV Displays: Techniques and Technologies, 2007. [5] Bernardin, K., and Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. EURASIP Journal on Image and Video Processing 2008 (2008), 110. [6] Blinn, J. F. Models of Light Reection for Computer Synthesized Pictures. ACM SIGGRAPH Computer Graphics 11, 2 (1977), 192198. [7] Boyle, M. The Eects of Capture Conditions on the CAMShift Face Tracker. Alberta, Canada: Department of Computer Science, (2001). [8] Bradski, G. Real Time Face and Object Tracking as a Component of a Perceptual User Interface. In Proceedings of the Fourth IEEE Workshop on Applications of Computer Vision, 1998. WACV 98., pp. 214 219. [9] Burbeck, S. Applications Programming in Smalltalk-80: How to Use Model-ViewController (MVC). http://st-www.cs.illinois.edu/users/smarch/st-docs/mvc.html. Last accessed in 07/04/2012 , 12. [10] Burgin, W., Pantofaru, C., and Smart, W. D. Using Depth Information to Improve Face Detection. In Proceedings of the 6th International Conference on Human-Robot Interaction (New York, NY, USA, 2011), HRI 11, ACM, pp. 119120. [11] Carbonetto, P. ~pcarbo. Training Data for Robust Object Detection. http://www.cs.ubc.ca/
[12] Crow, F. C. Shadow Algorithms for Computer Graphics. In Proceedings of the 4th Annual Conference on Computer graphics and Interactive Techniques (1977), vol. 11, ACM Press, pp. 242 248. [13] Cutting, J. E., and Vishton, P. M. Perceiving Layout and Knowing Distances: The Integration, Relative Potency, and Contextual Use of Dierent Information About Depth. Perception 5, 3 (1995), 137. [14] Dodgson, N. A. Autostereoscopic 3D Displays. Computer 38, 8 (2005), 3136.
77
BIBLIOGRAPHY
78
[15] Freund, Y., and Schapire, R. E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Computational Learning Theory 139 (1995), 119139. [16] Fukunaga, K., and Hostetler, L. The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition. IEEE Transactions on Information Theory 21, 1 (1975), 3240. [17] Garstka, J., and Peters, G. View-dependent 3D Projection using Depth-Image-based Head Tracking. In 8th IEEE International Workshop on Projector Camera Systems PROCAMS (2011), pp. 5257. [18] GiD. Google Image Downloader. http://googleimagedownloader.com. [19] Goldstein, E. B. Sensation and Perception. Wadsworth Pub Co, 2009. [20] Gouraud, H. Continuous Shading of Curved Surfaces. IEEE Transactions on Computers C-20, 6 (1971), 623629. [21] Heidmann, T. Real Shadows, Real Time, vol. 18. 1991. [22] Herrera C., D., and Kannala, J. Accurate and Practical Calibration of a Depth and Color Camera Pair. Computer Analysis of Images and (2011). [23] Holliman, N. 3D Display Systems. Handbook of Optoelectronics. IOP Press, London (2005). [24] Holliman, N., Dodgson, N., Favalora, G., and Pockett, L. Three-Dimensional Displays: A Review and Applications Analysis. Broadcasting, IEEE Transactions on 57, 99 (June 2011), 110. [25] Hubona, G. S., Shirah, G. W., and Jennings, D. K. The Eects of Cast Shadows and Stereopsis on Performing Computer-Generated Spatial Tasks, 2004. [26] ImageMagick. Mogrify Command-Line Tool. http://www.imagemagick.org/www/mogrify. html. [27] Jensen, O. Implementing the Viola-Jones Face Detection Algorithm. M.Sc Thesis, Informatics and Mathematical Modelling, Technical University of Denmark (2008). [28] JetBrains. dotTrace 5.0 Performance. http://www.jetbrains.com/profiler. [29] Jones, A., McDowall, I., Yamada, H., Bolas, M., and Debevec, P. Rendering for an Interactive 360 Light Field Display. ACM Transactions on Graphics (TOG) 26, 3 (2007), 40. [30] Kooima, R. Generalized Perspective Projection. gen-perspective.pdf, 2009. http://aoeu.snth.net/static/
[31] L. Xia C.-C. Chen, and Aggarwal, J. K. Human Detection Using Depth Information by Kinect. In Workshop on Human Activity Understanding from 3D Data in conjunction with CVPR (HAU3D) (Colorado Springs, USA, 2011). [32] Manohar, V., Soundararajan, P., and Raju, H. Performance Evaluation of Object Detection and Tracking in Video. In Proceedings of the Seventh Asian Conference on Computer Vision (2006), pp. 151161.
BIBLIOGRAPHY
[33] Microsoft. Kinect kinectforwindows. for Windows SDK.
79
http://www.microsoft.com/en-us/
[34] OpenTK. The Open Toolkit Library. http://www.opentk.com. [35] Papageorgiou, C., and Oren, M. A General Framework for Object Detection. Computer Vision, 1998. (1998), 555562. [36] Rowley, H. A., Baluja, S., and Kanade, T. Neural Network-Based Face Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 1 (1998), 2338. [37] Schapire, R. Improved Boosting Algorithms Using Condence-Rated Predictions. Machine Learning (1999). [38] Shoushtarian, B., and Bez, H. E. A Practical Adaptive Approach for Dynamic Background Subtraction Using an Invariant Colour Model and Object Tracking. Pattern Recognition Letters 26, 1 (2005), 526. [39] Stiefelhagen, R., Bernardin, K., Bowers, R., Rose, R. T., Michel, M., and Garofolo, J. The CLEAR 2007 Evaluation. Multimodal Technologies for Perception of Humans 4625 (2008), 334. [40] Urey, H., Chellappan, K. V., Erden, E., and Surman, P. State of the Art in Stereoscopic and Autostereoscopic Displays. Proceedings of the IEEE 99, 4 (Apr. 2011), 540555. [41] Viola, P. Rapid Object Detection Using a Boosted Cascade of Simple Features. Proceedings of the CVPR 2001 (2001). [42] Viola, P., and Jones, M. Fast and Robust Classication using Asymmetric AdaBoost and a Detector Cascade. Advances in Neural Information Processing Systems 14 (2002), 13111318. [43] Viola, P., and Jones, M. Robust Real-Time Face Detection. Int. J. Comput. Vision 57, 2 (May 2004), 137154. [44] Xia, L., Chen, C.-c., and Aggarwal, J. K. Human Detection Using Depth Information by Kinect. Pattern Recognition (2011), 1522. [45] Zhang, C., Yin, Z., and Florencio, D. Improving Depth Perception with Motion Parallax and its Application in Teleconferencing. In Multimedia Signal Processing, 2009. MMSP09. IEEE International Workshop on (2009), IEEE, pp. 16.
Appendix A Depth Cue Perception

A.1 Oculomotor Cues
Oculomotor cues are created by two phenomena: convergence and accommodation. Convergence is the inward movement of the eye (created by stretching the extraocular muscles) that occurs when the object of focus moves closer to the eye (see gure A.1). The kinesthetic sensations that arise are processed in the visual cortex and serve as cues for the depth perception. Accommodation is the change in the shape of the eye lens that occurs when the sight is focused on the objects at dierent distances. Ciliary muscles stretch the lens making it thinner thus changing the eyes focal length (see gure A.2). Similarly to convergence, the kinesthetic sensations that arise from contracting and relaxing ciliary muscles serve as basic cues for the distance interpretation. Both of those phenomena are most eective at the range of up to 10 meters from the observer [13] and provide absolute distance information.
A.2
Monocular Cues
Monocular cues provide depth information when the scene is viewed with just one eye. They are typically split into pictorial and motion cues.
a)
b)
Figure A.1: Eye convergence on a) near and b) far target.
80
APPENDIX A. DEPTH CUE PERCEPTION
81
a)
b)
Figure A.2: Right eye accommodation on a) near and b) far target.
A.2.1
Pictorial Cues
Pictorial cues are the sources of depth information that are present purely in the image formed on the retina. They include:
Occlusion, which occurs when one object is hiding another from view. The partially hidden object is then interpreted as being farther away. Relative height. The object that is below horizon and has its base higher in the eld-of-view is interpreted as being farther away. Relative size, which occurs when two objects that are of equal size occupy dierent amounts of space in the eld-of-view. Typically, the object that subtends a larger visual angle of the retina than the other is interpreted as being closer. If the objects size is known, then this prior knowledge can be combined with the angle that the object subtends on the retina to provide cues about its absolute distance. Relative density, which occurs when a cluster of objects or texture features have a characteristic spacing on the retina, and the observer is able to infer the distance to the cluster by the perspective foreshortening eects on this characteristic spacing. Perspective convergence, which occurs when the parallel lines extending from the observer appear to be converging at innity. The distance between these two lines provide hints about the distances from observer to objects on these lines. Atmospheric perspective which occurs when the objects in the distance appear less sharp, have lower luminance contrast, lower colour saturation and the colours are slightly shifted towards the blue end of the spectrum. This happens because the light from far away objects is scattered by small molecules in the air (water droplets, dust, airborne pollution). Lighting and shadows. The way that light reects o the surfaces of an object and the shadows that are cast provide cues to the visual cortex to determine both the shape and the relative position of the objects. Texture gradient, which manifests as a decrease in neness of the texture details with increasing distance from the observer. This change in texture detail as the objects recede is detected in parietal cortex and provides further depth information.
82
Figure A.3: A photograph exhibiting a number of pictorial depth cues: occlusion, relative height, size and density, atmospheric perspective, lighting and shadows, and texture gradient.
Most of these pictorial depth cues are demonstrated in gure A.3.
A.2.2
Motion Cues
All the cues described above are present for the stationary observer. However, if the observer is in motion, the following new cues emerge that further enhance human perception of depth:
Motion parallax, which occurs when the objects closer to the moving observer seem to move faster and in opposite direction to the movement of the observer, whereas the objects farther away move slower and in the same direction. This dierence in motion speeds provides hints about their relative distance. Given the surface marking and the some knowledge about the observers position, motion parallax can yield the absolute measure of depth at each point of the scene. Deletion and accretion. Deletion occurs when the object in the background gets covered by the object in front when the observer moves, and accretion occurs when the observer moves in the opposite direction and the object in the background gets uncovered. This information can then be used to infer depth order.
83
a)
b)
Figure A.4: The points on left and right retinae with the same relative angle from the fovea are
known as the corresponding retinal points (or cover points). Absolute disparity is the angle between two corresponding retinal points. Horopter is an imaginary surface that passes through the point of xation; only images of the objects on the horopter fall on corresponding points on the two retinae; they also have the absolute disparity equal to zero (e.g. objects A1 and B in picture a). Relative disparity is the dierence between two objects absolute disparities. Notice that the absolute disparity of the object A1 changes from 0 in picture a) to in picture b), but the relative disparity between objects A1 and A2 remains constant.
A.3
Binocular Cues
In the average adult human, the eyes are horizontally separated by about 6 cm, hence even when looking at the same scene, the images formed on the retinae are dierent. The dierence in the images in the left and right eyes is known as binocular disparity. Binocular disparity gives rise to two phenomena that provide information about the distances of objects, absolute disparity and relative disparity, which are illustrated in the gure A.4. It has been shown that this information about depth which is present in the geometry (both absolute and relative disparity), is actually translated into depth perception in the brain, creating a stereopsis depth cue. In particular, neurons in the striate cortex respond to absolute disparity (Uka & DeAngelis, 2003), and the neurons higher up in the visual system (temporal lobe and other areas) respond to the relative disparity (Parker, 2007).
Appendix B 3D Display Technologies

B.1 Binocular (Two-View) Displays
Binocular displays generate two separate viewing zones, one for each eye. Various multiplexing methods (and their combinations) are used to provide the binocular separation of the views:
Wave-length division, used in anaglyph-type (wavelength-selective) displays (e.g. red/cyan colour channel separation using anaglyph glasses, or amber/blue channel separation used in ColorCode 3D display system, both shown in gure B.1).
Most of the technologies based on wave-length division require eyewear (stereoscopic displays).
Space/direction division, used in parallax-barrier type and lenticular-type displays. These are mainly autostereoscopic displays, i.e. they do not require glasses.
Also, a number of space/direction division based displays can be combined with head tracking to provide the viewing zone movement (using shifting parallax barriers/lenticulars, or steerable backlight).
Time division, used in active LCD-shutter glasses (e.g. DepthQ system, gure B.2), Polarization division, used in systems requiring passive polariser glasses (e.g. RealD ZScreen, gure B.2).
a)
b)
Figure B.1: Wave-length division display technologies: a) red/cyan channel multiplexed glasses for anaglyph 3D image viewing, b) patented ColorCode 3D display system that uses amber/blue colour channel multiplexing to produce full colour 3D images.
84
APPENDIX B. 3D DISPLAY TECHNOLOGIES
85
a)
b)
Figure B.2: Time- and polarization-division based technologies: a) RealD ZScreen display system
that uses a single projector equipped with an electrically controllable polarization rotator, to produce orthogonally polarized frames, b) DepthQ display system that uses a single projector with timemultiplexed output (to be viewed with active liquid-crystal based shutter glasses).
B.2
Multi-View Displays
Multi-view displays create a xed set of viewing zones across the viewing eld, in which dierent stereo pairs are presented. Typical implementation techniques for this type of displays include:
Combination of pixelated emissive displays, with static parallax barriers or lenticular arrays (integral imaging displays). For the latter, hemispherical (as opposed to cylindrical) lenslets can be used to provide vertical, as well as horizontal parallax.
However, constraints on pixel size and resolution in LCD or plasma displays limit horizontal multiplexing to a small number of views [14]. Also, parallax barriers can cause a signicant light loss with the increasing number of views, whereas lenticular displays magnify the underlying subpixel structure of the device, creating dark transitions between viewing zones.
Multiprojector displays, where the image from each projector is projected on the entire doublelenticular screen, but is visible only within the corresponding viewing regions at the optimal viewing distance. These displays require a very precise alignment of projected images, and are extremely costly since they require having a single projector per view. Time-sequential displays, where the dierent views are generated by a single display device running at a very high frame rate. A secondary optical component (synchronized to the former image-generation device) then directs the images at dierent time-slots to dierent viewing zones. An example implementation using a high-speed CRT monitor and liquid crystal shutters in lens array has been developed at Cambridge (see gure B.3). However, the optical path length required by such displays reduces their commercial appeal in comparison to the atpanel displays [23].
86
a)
b)
Figure B.3: Multi-view and light-eld 3D display technologies: image a) shows a 25 diagonal, 28view time-multiplexed autostereoscopic display system developed at Cambridge. A high-speed CRT display renders each view sequentially and the synchronised LCD shutters direct the view through a Fresnel eld lens at the appropriate angle. Image b) shows the light-eld display system as described by Jones et al. [29] consisting of a high-speed video projector and a spinning mirror covered by a holographic diuser.
B.3
Light-Field (Volumetric and Holographic) Displays
Light-eld displays simulate light faring in every direction through every point in image volume. Volumetric displays generate images by rendering each point of the scene at its actual position in space through slice-stacking, solid-state processes, open-air plasma eects and so on. Sample implementations of such displays include laser projection onto a spinning helix (Lewis et al.), varifocal mirror displays (Traub) or swept-screen systems (Hirsch). Holographic displays attempt to reconstruct the light-eld eld of a 3D scene in space by modulating coherent light (e.g. with spatial light modulators, liquid crystals on silicon, etc). Two commercial examples of holographic displays are Holograka, which uses a sheet of holographic optical elements as its principal screen, and Quinetiq system, which uses optically-addressed spatial light modulators. Another light-eld display system is described by Jones et al. [29], which consists of a high-speed video projector and a spinning mirror covered by a holographic diuser (see gure B.3).
B.4
3D Display Comparison w.r.t. Depth Cues
All display types listed above can simulate all of the pictorial cues. Two-view displays without head tracking add stereopsis to the pictorial depth cues, and head-tracked two-view displays can simulate motion parallax. However, two-view displays typically require eyewear or head-tracking.
87
Multi-view displays create the perception of stereopsis and can simulate motion parallax without the head-tracking/eyewear. However, motion parallax is typically segmented into discrete steps and is only horizontal. Building multiview displays with a large number of views to overcome these problems remains technologically challenging. Light-eld displays can provide continuous motion parallax and accommodation depth cues (besides stereopsis, convergence and pictorial depth cues). However, as described by Holliman et al. [24], volumetric displays remain a niche product, and computation holography remains experimental. In general, despite the fact that most of the stereoscopic binocular display systems have been manufactured for decades and some of the autostereoscopic systems have been available for 10-15 years, they are still mainly used in niche applications (further discussed in the following section).
B.5
B.5.1
3D Display Applications
Scientic and Medical Software
Geospatial applications, in which 3D displays are used for terrain analysis, defence intelligence gathering, pairing of aerial and satellite imagery by photogrammetrists, and so on. Oil and gas applications, in which 3D displays help exploration geophysicists to visualise subterranean material density images, in order to make more accurate predictions where petroleum reservoirs might be located. Molecular modelling, computation chemistry, crystallography visualisations. Since the structure of a particular molecule is determined the spatial location of its molecular constituents, 3D displays can help to visualize spatial relationships between thousands of atoms in a given molecule, helping to determine its structure and function. Mechanical design, where 3D displays can help industrial designers, mechanical engineers and architects to design and showcase complex 3D models. Medical applications, in which magnetic resonance imaging (MRI), computed tomography (CT), ultrasound and other inherently volumetric images can be represented in 3D to help doctors make a more accurate and quicker judgement. Three-dimensional displays can also help in minimally invasive surgeries (MIS) to give surgeons a better understanding of depth and position when making critical movements. Training of complex operations, remote robot manipulation in dangerous environments, augmented and virtual reality applications, 3D teleconferencing and so on.
B.5.2
Gaming, Movie and Advertising Applications
In this application class, 3D displays have the advantage of novelty and increased user imersiveness over regular 2D displays.
88
Figure B.4: Free2C interactive kiosk (built for the use at showrooms, shops, airports, etc), which
uses head-tracking to control a vertically aligned lenticular screen to overcome the xed viewing-zone requirement. Over the last few decades this advantage has been exploited by a large number of dierent 3D display systems, manufactured for the purpose of advertising. An example of such system (Free2C interactive kiosk) is shown in gure B.4. Similarly, a number of recent developments show an increasing interest in 3D display technologies for movies and gaming. Examples given by Zhang et al. in [45] include Nvidias release of a 3D Vision technology stereoscopic gaming kit (in 2008) containing liquid-crystal shutter glasses and a GeForce Stereoscopic 3D Driver (enabling 3D gaming on supported displays), an agreement between The Walt Disney Company and Pixar (made in April 2008) to make eight new 3D animated lms over the next four years, and an announcement by DreamWorks Animation that it will release all its movies in 3D, starting in 2009.
Appendix C Computer Vision Methods (Additional Details)

C.1
C.1.1

Weak Classier Boosting using AdaBoost
AdaBoost combines a collection of simple classication functions into a stronger classier through a number of rounds, where in each round
the best weak classier (simple classication function) for the current training data is found, lower/higher weights are assigned to correctly/incorrectly classied training examples.
The nal strong classier is obtained by taking a weighted linear combination of weak classiers, where the weights assigned to individual weak hypotheses are inversely proportional to the number of classication errors that they make. These steps are illustrated in the gure C.1, and precisely formalized in the algorithm C.1.1.1. A number of properties of AdaBoost have been proven. Of a particular interest is a generalized theorem of Freund and Shapire, by Shapire and Singer[37] which states that the training error of a strong classier decreases exponentially in the number of rounds, i.e. the training error at round T is bounded by 1 N
N
(C(xi ) = yi )
i=1
1 N
exp (yi f (xi )) ,

i=1 T t=1 t ht (x).
(C.1)
where N is the number of training examples and f (x) =
AdaBoost is designed to minimize the quantity related to the overall classication error, but in the context of the face detection it is not the most optimal strategy. As discussed in section 2.4.1.4, it is more important to minimize the false negative rate than the false positive.
C.1.1.1
AsymBoost Modication
AsymBoost (asymmetric AdaBoost) is a variant of AdaBoost (presented by Viola and Jones in 2002 [42]) specically designed to be used in classication tasks where the distribution of positive and negative training examples is highly skewed. The x proposed in [42] is to adjust the training
89
APPENDIX C. COMPUTER VISION METHODS (ADDITIONAL DETAILS)
90
Figure C.1: A simplied illustration of AdaBoost weak classier boosting algorithm given in C.1.1.1. In this training sequence, three weak classiers that minimize the classication error are selected; after selecting each classier, the remaining training examples are reweighed (increasing/decreasing the weights of incorrectly/correctly classied examples respectively). After selecting all three classiers, a weighed linear combination of their individual thresholds is taken, yielding a nal strong classier.
91
Algorithm C.1.1.1 Weak classier boosting using AdaBoost. It requires N training examples given in the array A = (x1 , y1 ), ..., (xN , yN ) (where yi = 0 for a negative and yi = 1 for a positive training example) and uses T weak classiers to construct a strong classier. The result of the boosting is the nal strong classier h(x), which is a weighted linear combination of T hypotheses with the weights inversely proportional to the training errors. AdaBoost(A, T ) 1 / Initialize training weights (where m is the count of negative, l is the count of / 2 / positive training examples). / 3 for each training example (xi , yi ) A 4 if yi = 0 1 5 w1,i 2m 6 else 1 7 w1,i 2l 8 for t 1 to T 9 / 1. Normalize the weights: / 10 for each weight wt,i w 11 wt,i N t,i w
j=1 t,j
12 13 14 15 16 17 18 19 20
/ 2. Select the best weak classier h(x, ft , pt , t ) which minimizes the error / / / t = minf,p, i wt,i |h(xi , f, p, ) yi |: ht (x) Find-Best-Weak-Classifier(wt , A) / 3. Update the weights: / for each training example (xi , yi ) A if ht (xi ) = yi wt+1,i wt,i 1t t else wt+1,i wt,i
T 1 t 1, t=1 ht (x) log t 0, otherwise. 1 2 T t=1
21 return h(x) =
log 1t t ,

weights in each round by a multiplicative factor of exp 1 yi log k , T
92
(C.2)
where T is the number of rounds1 , yi is a class of the training example i and k is the penalty ratio between false negatives and false positives.
C.1.2
Best Weak-Classier Selection
The algorithm to eciently nd the best decision stump weak classier is given in C.1.2.1. The asymptotic time cost to nd the best weak classier for a given training round is O(KN log N ), where K is a number of features and N is the number of training examples.
Algorithm C.1.2.1 Selection of the best decision stump weak classier. It requires an array of training examples A = (x1 , y1 ), ..., (xN , yN ), together with the training example weights wt . This algorithm returns the best rectangle feature based decision stump classier. Find-Best-Weak-Classifier(wt , A) 1 Calculate T + , T (total sums of positive/negative example weights). 2 for each feature f 3 for each training example (xi , yi ) A 4 vi f (xi ) 5 6 7 v.sort() for vi v Maintain Si+ , Si (total sums of positive/negative weights below the current example). / Calculate the current error: / + + + f,i = min{S + (T S ), S + (T S )} If f,i is smaller than the previously known smallest error, remember the current threshold f and parity pf . Maintain the feature with the smallest error fb and the associated threshold b & parity pb .
8 9 10
11
12 return h(x, fb , pb , b )
If the strong classier obtained using AsymBoost is to be used in the attentional cascade (see section 2.4.1.4), the number of rounds required to train a particular strong classier will be unknown in advance. In that case, it can be approximated using the round counts of the previous two layers: Ti+2 = Ti+1 + (Ti+1 Ti ).
1
93
C.1.3
Cascade Training
The precise algorithm to build a cascade Viola-Jones face detector is shown in the listing C.1.3.1.
Algorithm C.1.3.1 Building a cascaded detector. It requires the maximum acceptable false positive
rate per layer f , the minimum acceptable acceptable detection rate per layer d, the target overall false positive rate Ftarget , a set of positive training examples P and a set of negative training examples N . The algorithm returns a cascaded detector C(x).
Build-Cascade(f, d, Ftarget , P, N ) 1 C(x) 2 F0 1.0, D0 1.0 3 i0 4 while Fi > Ftarget 5 ii+1 6 ni 0, Fi Fi1 7 8 9 10 11 12 while Fi > f Fi1 ni ni + 1 hi (x) AdaBoost(N P, ni ) C(x) C(x) hi (x) Evaluate the cascaded classier on validation set to determine Fi and Di . Decrease threshold for hi (x) until the cascaded classier has a detection rate of at least d Di1 . N if Fi > Ftarget Evaluate C(x) on the set of non-face images and put any false detections into N (bootstrap negative images).
13 14 15
16 return C(x)
C.1.3.1
Training Time Complexity
As briey discussed in section 2.4.1.3, the asymptotic-time cost to nd the best decision-stump weak classier is O(KN log N ), where K is a number of features and N is the number of training examples. Then, the cost of training a single strong classier is O(M KN log N ) where M is the number of weak classiers combined through boosting. Finally, the cost of training a detector cascade containing L strong classiers is O(LM KN log N ). To put the numbers into perspective, assume that it takes 10 milliseconds on average to evaluate a rectangle feature on 10, 000 training images (1 sec/image). Then, to train a cascade containing 25
94
strong classiers, with a total of 4, 000 decision-stumps, selected from 160, 000 features and trained on 10, 000 training examples would require over 74 days of continuous training (without considering the time it takes to select a best feature out of 160, 000, and to bootstrap the false positive training images for each layer).
C.2
C.2.1

Mean-Shift Technique
CAMShift face tracker is based on the mean-shift technique, which is a non-parametric technique to climb the gradient of a given probability distribution to nd the nearest dominant peak (mode). The precise details of this technique are summarized in the algorithm C.2.1.1.
Algorithm C.2.1.1 Two dimensional mean shift. It requires the probability distribution P , initial
location of the search window (x, y), search window size s and the convergense threshold . It returns the location of the nearest dominant mode of the probability distribution P .
2D-Mean-Shift(P, x, y, s, ) 1 (xc , yc ) (x, y) 2 repeat 3 (xc , yc ) (xc , yc ) 4 5 6 7 8 / Find the zeroth moment of the search window / M00 P (xc + x, yc + y)
|x|s/2 |y|s/2
/ Find the rst horizontal and vertical moments / M10 xP (xc + x, yc + y)

|x|s/2 |y|s/2
M01
|x|s/2 |y|s/2
yP (xc + x, yc + y)
9 (xc , yc ) M10 , M01 M00 M00 10 until Distance((xc , yc ), (xc , yc )) < 11 return (xc , yc )
C.2.2
Centroid and Search Window Size Calculation
Dene a shorthand p(x , y ) Pr(I(x , y ) belongs to a face). Then the face centroid and the search window size can be calculated as follows.

1. Compute the zeroth moment: M00 =
x,yIs
95
p(x, y),
(C.3)
where Is is the current search window. 2. Compute the rst horizontal and vertical spatial moments: M10 =
x,yIs
x p(x, y), (C.4) y p(x, y).

x,yIs
M01 =
3. The centroid location (xc , yc ) is then given by (xc , yc ) = Similarly, the size s of the search window is set as s = 2 M00 . (C.6) M10 M01 , M00 M00 . (C.5)
This expression is based on two observations: rst of all, the zeroth moment represents a distribution area under the search window, hence assuming a rectangular search window, its side length can be approximated as M00 . Secondly, the goal of CAMShift is to track the whole object, hence the search window needs to be expansive. A factor of two ensures the growth of the search window so that the whole connected distribution area would be spanned. Bradski also suggests that in practice the search window width and height for the face tracking can be set to s and 1.2s respectively, to resemble the natural elliptical shape of the face.
C.3
C.3.1

Background Model Initialization
Background models used in ViBe algorithm can be instantaneously initialized using only the rst frame of the video sequence. Since no temporal information is present in a single frame, the main assumption made is that the neighbouring pixels share a similar temporal distribution. Under this assumption, the pixel models can be populated using the values found in the spatial neighbourhood of each pixel; based on the empirical observations by Barnich and Van Droogenbroeck, selecting samples from the 8-connected neighbourhood of each pixel has proven to be satisfactory for VGA resolution images.
96
This observation can be formalized in the following way. Say that NG (x) is a spatial neighbourhood of a pixel x, then M0 (x) = {v 0 (y)}, (C.7)
where locations y NG (x) are chosen randomly according to the uniform law, Mt (x) is a model of pixel x at time t, and v t (x) is the colour-space value of pixel x at time t.
C.3.2
Background Model Update
After a new pixel value v t (x) is observed, the memoryless update policy dictates that the old to-bediscarded sample would be chosen randomly from Mt1 (x), according to a uniform probability density function. This way the probability that a sample which is present at time t will not be discarded at 1 time t + 1 is NN , where N = |Mt (x)|. Assuming time continuity and the absence of memory in the selection procedure, the probability that a sample in question will still be present after dt time units is N 1 N which is indeed an exponential decay. Since in many practical situations it is not necessary to update each background pixel model for each new frame, the time window covered by a pixel model of a xed size can be extended using the random time subsampling. In ViBe this is implemented by introducing a time subsampling factor . If a pixel x is classied as belonging to the background, its value v(x) is used to update its model M(x) with the probability 1 . Finally, based on the assumption that neighbouring background pixels share a similar temporal distribution, the neighbouring pixel models are stochastically updated when the new background sample of a pixel is taken. More precisely, given the 8-connected spatial neighbourhood NG (x) of a pixel x, the model M(y) of one of the neighbouring pixels y NG (x) is updated (y is chosen randomly, with the uniform probability). This approach allows a spatial diusion of information using only samples classied exclusively as background, i.e. the background model is able to adapt to changing illumination or the structure of the scene while retaining a conservative update 2 policy.
dt
= exp dt ln
N N 1
(C.8)
Conservative update policy never includes a sample belonging to foreground in the background model.[2]
Appendix D Depth-Based Methods (Additional Details)

D.1
D.1.1
Depth Data Preprocessing

Depth Shadow Elimination
In order to obtain the depth values in the frame, Kinect uses the infrared light to project a reference dot pattern on the scene, which is then captured using an infrared camera. Since these images are not equivalent due to the horizontal distance between the projector and the camera, stereo triangulation can be used to calculate the depth after the correspondence problem is solved. However, this leads to depth shadowing (see the example in gure D.1). Since the infrared projector is placed 2.5 cm to the right of the infrared camera, depth shadows of the concave objects always appear on the left side if the sensor is placed on a at horizontal surface. This suggests a straightforward depth shadow elimination technique for head tracking (as heads are indeed concave): 1. Process the depth images one horizontal line at a time, from left to right. 2. If the unknown depth value is reported by Kinect, replace it using the last known depth value. An example of the depth shadow removed using this technique is shown in gure 2.11.
D.1.2
Real-Time Depth Image Smoothing
The noise in the depth calculation can arise from the inaccurate measure of disparities within the correlation algorithm, limited resolutions of the Kinect infrared camera and projector, external infrared radiation (e.g. sun light), object surface properties (especially high specularity), and so on. In the detection method below, every local minimum on a horizontal scan line will be treated as a point which potentially lies on the vertical head axis (a hypothesis which will be conrmed or refuted using certain prior knowledge about human head sizes). Since nding a local minimum basically involves discrete dierentiation, such method is very prone to noise. A solution proposed in [17] is to smooth the depth-image in real time using the integral image lter from Viola and Jones face detection algorithm. As further described in section 2.4.1, the integral image can be calculated in linear-time using a dynamic programming approach; a smoothed depth value Ir (x, y) of the pixel at coordinates (x, y)
97
APPENDIX D. DEPTH-BASED METHODS (ADDITIONAL DETAILS)
98
Figure D.1: Kinect depth shadowing. Light blue polygon shows the area which is visible from the IR
camera point of view, light red polygon shows the region where the IR pattern is projected. Thicker blue lines indicate the areas on the objects that are visible by the IR camera, thicker red lines indicate the areas on the objects that have IR pattern projected on them. can be obtained by calculating Ir (x, y) = I(x r, y r) I(x + r, y r) I(x r, y + r) + I(x + r, y + r) , (2r + 1)2 (D.1)
where I is the integral image and r is the side length of the averaging rectangle. The result of smoothing using dierent average rectangle sizes is shown in the gure 2.10.
D.2
Depth Cue Rendering
This subsection describes two algorithms which are used to render various depth cues, as described in the main project aims 1.5. More precisely, a generalized perspective projection [30] is used to simulate the motion parallax when the viewers head position is known, and the Z-Pass algorithm [21] is used to simulate the pictorial shadow depth cue. More details on both of these algorithms are given below.
D.2.1
Generalized Perspective Projection
A generalized perspective projection (as described by Kooima in [30]) is used to simulate the motion parallax, occlusion, relative height, relative size, relative density and perspective convergence depth cues. The generalized perspective projection matrix G can be derived as follows. Let pa , pb , pc be the three corners of the screen as shown in gure D.2. Then the screen-local axes vr , vu and vn that give the
99
Figure D.2: Screen denition for the generalized perspective projection. Viewer-space points pa , pb , pc give the three corners of the screen, point pe gives the position of the viewers eye, screen-local axes vr , vu and vn give the orthonormal basis for describing points relative to the screen, non-unit vectors va , vb and vc span from the eye position to the screen corners and distances from the screen-space origin l, r, t, b give the left/right/top/bottom extents respectively of the perspective projection.
orthonormal basis for describing points relative to the screen can be calculated using pb pa , ||pb pa || pc pa vu = , ||pc pa || vr vu . vn = ||vr vu || vr =
(D.2)
If the viewers position changes in a way that the head is no longer positioned at the center of the screen, the frustum becomes asymmetric. The frustum extents l, r, b, t can be calculated as follows. Let va = pa pe , vb = pb pe , vc = pc pe , where pe is the position of the viewer in the world coordinates. Then the distance from the viewer to the screen-space origin is d = (vn va ). Given this information, the frustum extents can be computed using n l = (vr va ) , d n r = (vr vb ) , d n b = (vu va ) , d n t = (vu vc ) . d (D.3)
(D.4)
100
Let n, f be the distances of the near and far clipping planes respectively. Then the 3D perspective projection matrix P (which maps from a truncated pyramid frustum to a cube) is
2n rl
0
2n tb
0 P = 0 0
0 0
r+l rl t+b tb f +n f n
0 0
2f n f n
(D.5)
The base of the viewing frustum would always lie in XY-plane in the world coordinates. To enable the arbitrary positioning of the frustum, two additional matrices are needed:
MT =
vr,x vr,y vr,z vu,x vu,y vu,z vn,x vn,y vn,z 0 0 0
0 0 0 1
, (D.6)
which transforms points lying in the plane of the screen to lie in the XY-plane (so that the perspective projection could be applied), and T = 1 0 0 0 0 1 0 0 0 pe,x 0 pe,y 1 pe,z 0 1 , (D.7)
which translates the tracked head location to the apex of the frustum. Finally, note that the composition of linear transformations in homogeneous coordinates corresponds to the product of the matrices that describe these transformations. This way, the overall generalized perspective projection G (which produces a correct o-axis projection given constant screen corner coordinates pa , pb , pc and a varying head position pe ) can be calculated by taking a product of the three matrices described above, i.e. G = P M T T. (D.8)
D.2.2
Real-Time Shadows using Z-Pass Algorithm with Stencil Buers
As discussed in section A.2.1, cast shadows are very important for the human perception of the 3D world. In particular, shadows play an important role in understanding the position, size and the geometry of the light-occluding object, as well as the geometry of the objects on which the shadow is being cast. The rst hardware-accelerated algorithm that uses stencil buers and shadow volumes to render shadows in real-time was described by Heidmann in 1991 [21]. His technique uses a following two-step process:
101
Figure D.3: Shadow volume of a triangle (white polygon) lit by a single point-light source. Any point inside this volume is in the shadow, everything outside is lit by the light.
1. The scene is rendered as if it was completely in the shadow (e.g. using only ambient lighting), 2. Shadow volumes are calculated for each face and the stencil buer is updated to mask the areas within the shadow volumes; then for each light source, the scene is rendered as if it was completely lit, using the stencil buer mask.
D.2.2.1
Shadow Volumes
Shadow volumes were rst proposed by Crow[12] in 1977. A shadow volume is dened by the objectspace tesselations of the boundaries of the regions of space, occluded from the light source [12]. To understand how the shadow volume can be constructed without the loss of generality consider a triangle lit by a single point-light source. Projecting rays from the light source through each of vertices of the triangle to the points at innity will form a shadow volume. Any point inside that volume is hidden from the light source (i.e. it is in the shadow), everything outside is lit by the light (see gure D.3).
D.2.2.2
Z-Pass Shadow Algorithm
After calculating shadow volumes, the locations in the scene where the shadows should be rendered can be found in the following way: 1. For every pixel, project the ray from the viewpoint to the object visible at that pixel.
102
Figure D.4: Z-Pass algorithm. Blue polygon represents the viewing area from the camera point of view, grey polygons represent the shadow volumes, blue points indicates the entries to the shadow volumes, red points indicate the exits. Numbers above the blue/red points indicate the operation that is being performed on a stencil buer; if more shadow volumes have been entered than left (i.e. the value present at the stencil buer is greater than zero) then the pixel in question is in the shadow.
2. Follow this ray, counting the number of times when some shadow volume is entered and left. For every pixel, subtract the number of times when some shadow volume is left from the number of times when some shadow volume is entered. 3. If this count is greater than zero when the object is reached, more shadow volumes have been entered than left, therefore that pixel of the object must be in the shadow. See gure D.4 for an illustration.
D.2.2.3
Stencil Buer Implementation
A stencil buer is an integer per-pixel buer (additional to the colour and depth buers) found in modern graphics cards; it is typically used to limit the area of rendering. An interesting application of stencil buer in real-time shadow rendering arises from the strong connection between the depth and stencil buers in the rendering pipeline. Since the values in the stencil buer can be incremented/decremented every time the pixels passes or fails the depth test, the following two-pass implementation of the Z-Pass shadow algorithm (as described in [21]) becomes feasible: 1. Initialize the stencil buer to zero; render the scene with the lighting disabled. Amongst other things, this will load the depth buer with the depth values of the visible objects in the scene. 2. Enable the back-face culling, set the stencil operation to increment on depth-test pass and render the shadow volumes without writing the rendering result into colour and depth buers.

This will count the number of entries into the shadow volume as described above.
103
3. Enable the front-face culling and set the stencil operation to decrement on the depth-test pass. Again, render the shadow volumes without storing the render in the colour and depth buers. In this case, each pixel value in the stencil buer will be decremented when the ray leaves some shadow volume. As described in section D.2.2.2, only the pixels that have a stencil buer value of zero should be lit as they are the ones that lie outside the shadow volume. Using the zero values as a mask in the stencil buer and rendering the scene with the lighting enabled will correctly overwrite the previously shadowed pixels with the lit ones.
Appendix E Implementation (Additional Details)

E.1 Viola-Jones Distributed Training Framework
The main classes of the Viola-Jones distributed training framework are shown in gures E.1 and E.2 below. The main responsibilities of these classes are summarized in table E.1.
E.2
E.2.1
HT3D Library
Head Tracker Core
A UML 2.0 class diagram of the HT3D library core is shown in gure E.3, and the responsibilities of individual classes are summarized in table E.2.1.
E.2.2
Colour- and Depth-Based Background Subtractors
The class diagram for the colour- and depth-based background subtractors is shown in gure 3.13. ViBeBackgroundSubtractor, EuclideanBackgroundSubtractor and DepthBackgroundSubtractor classes have the shared responsibility to distinguish the moving objects (foreground) from the static parts of the scene (background). Further implementation details of these classes are given below.
E.2.2.1
In ViBeBackgroundSubtractor, the background models of pixels obtained from the 8-bit grayscale input bitmaps are internally represented as a three-dimensional byte array, where the rst two dimensions represent the pixel coordinates in the image and the third dimension serves as an index into the model of that pixel. The background models are updated over time following the theory given in section C.3.2. The background sensitivity of ViBe background subtractor is dened as the radius of the hypersphere SR in the colour space, as shown in gure 2.8.
E.2.2.2
Euclidean Background Subtractor
The background models in EuclideanBackgroundSubtractor are built under the assumption that at the moment of the background subtractor initialization, only background objects are present in the frame. Then the subsequent frames can be segmented into foreground and background by inspecting
104
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS)
105
Figure E.1: UML 2.0 class diagram of the Viola-Jones distributed training framework architecture
(part #1 of 2).
106
Figure E.2: UML 2.0 class diagram of the Viola-Jones distributed training framework architecture
(part #2 of 2).
107
Class
ViolaJones Trainer
Responsibilities Serves as an entry point to the program (includes input parameter parsing and server/client protocol set-up); provides core shared training server and client functionality (e.g. multi-threaded best local rectangle feature search); implements training state preservation and restore. Manages connections with clients; provides means of data serialization to XML (e.g. detector cascade) or to compressed binary format (e.g integral images, training weights); handles data transfer to clients over TCP/IP and CIFS. Implements connection and data exchange client-end, data deserialization and other client-specic functionality. Provides ecient means to generate, store and evaluate rectangle features. Implements Find-Best-Weak-Classifier algorithm (given in C.1.2.1) as part of the IWeakLearner interface. Encapsulates a collection of rectangle features obtained using AsymBoost into a strong learner (representing a single layer in the cascade). Encapsulates a collection of trained strong learners into a detector cascade. Stores large resolution negative training images; implements False-Positive-Training-Image-Bootstrapping algorithm given in 3.5.3.1. Stores normalized training images (both negative and positive) in the detector resolution scale. Provides helper functions (e.g. conversion between dierent image formats); implements various workarounds to prevent PWF machines from logging-o after a certain period of inactivity, as well as mouse and keyboard software locks, to prevent other users from accidentally shutting down training clients (see gure 3.3 for a picture of training machines in action). Implements a thread-safe, synchronized generic item list (e.g. used in storing bootstrapped false positive training images which are simultaneously sent by a number of clients). Handles output logging to hard drive in a thread-safe manner.
ViolaJones TrainerServer
ViolaJones TrainerClient Rectangle Feature DecisionStump WeakClassifier StrongLearner StrongLearner Cascade NegativeTrainingImage NormalizedTrainingImage Utilities
Synchronized List Log
Table E.1: Responsibilities of individual classes in the Viola-Jones distributed training framework.
108
Figure E.3: UML 2.0 class diagram of the HT3D library core.
109
Class
HeadTracker
Responsibilities Sets up the tracking environment: i) deserializes Viola-Jones face detector cascade from the training framework output XML le, ii) sets up Kinect SDK (registers for DepthFrameReady and VideoFrameReady events, opens depth2 and colour3 byte streams), iii) initializes face/head detection and tracking components. Orchestrates inputs and outputs from face/head detection and tracking components: i) aligns colour and depth images using a calibration procedure provided by Kinect SDK (which uses a proprietary camera model developed by the manufacturer of Kinect sensor, PrimeSense)1 ii) maintains the head, tracking state of depth and colour trackers (using HeadTrackerState enumeration), iii) prepares input data for individual tracking components (e.g. converting colour bitmaps to grayscale, or combining input colour bitmaps with background/foreground segmentation information), iv) invokes tracking components as required and combines their outputs, and v) passes the tracking output to HeadTrackFrameReady event subscribers via instances of the event arguments class HeadTrackFrameReadyEventArgs.
Statistics Handler
Provides means to record aligned colour and depth frames (as a stream of 320 240 px bitmap images), together with the output from the head/face trackers (serialized into XML le as a list of FaceCenterFramePair, allows recording and playback of the raw colour and depth frame data (onedimensional byte arrays provided by Kinect SDK), gathers statistics about the face/head detection and tracking speeds. Provides functionality to convert data between dierent formats (e.g. representing depth values as colours, or converting input bitmap to the HSV byte array) and various methods that simplify bitmap manipulation (e.g. resizing, conversion to grayscale, etc). Table E.2: Responsibilities of individual classes in the HT3D library core.
Utilities
The process of colour and depth image alignment is necessary since IR and RGB cameras have dierent intrinsics and extrinsics (due to the physical separation). As proposed by Herrera et al. [22], the intrinsics can be modelled using a pinhole camera model with radial and tangential distortion corrections and extrinsics can be modelled using a rigid transformation, consisting of a rotation and a translation. After the alignment, colour data is represented as 32-bit, 320240 px bitmap images and depth data is represented as two-dimensional (320 240) short arrays, where each item in the array represents the distance of the depth pixel from the Kinect sensor in millimetres. 320 240 px, 30 Hz. (While Kinect sensor supports 640 480 px depth output, 320 240 px is the highest resolution compatible with colour and depth image alignment API). 640 480 px, 30 Hz.
110
which individual pixels dier from the initial frame more than the background subtractor sensitivity threshold. More precisely, if If is the initial 8-bit grayscale input frame, Ic is the current frame being segmented and is the background subtractor sensitivity threshold, then a pixel (x, y) is classied as part of the background by EuclideanBackgroundSubtractor if |If (x, y) Ic (x, y)| < . (E.1)
E.2.2.3
Depth-Based Background Subtractor
While the depth-based background subtractor inherits from the same base BackgroundSubtractor class as colour-based background subtractors (see gure 3.13), it serves a slightly dierent purpose in the head-tracking pipeline. The main responsibility of the DepthBackgroundSubtractor class is to increase the speed and the accuracy of colour-based face detector and tracker, using the information provided by the DepthHeadDetectorAndTracker. In particular, as long as the depth-based tracker is accurately locked onto the viewers head (i.e. if the depth tracker state maintained in HeadTracker is equal to HeadTrackerState.TRACKING), all pixels that are further away from the Kinect sensor than the detected head center are classied as background. An illustration of this process is shown in gure 3.14.
E.3
3D Display Simulator Components
As described in section 3.7 and illustrated in gure 3.17, 3D display simulator consists of two small UI modules (3D Simulation Entry Point and Head Tracker Conguration), and a larger modelview-controller-based module (Z-Tris). Both UI module implementations and Z-Tris M-V-C architectural units are briey described below.
E.3.1
Application Entry Point
When the application is initialized, the Program class launches the MainForm (shown in gure 3.20). The main form is responsible for launching the conguration form, showing help or launching the game form.
E.3.2
Head Tracker Conguration GUI
ConfigurationForm class handles the communication to the HT3D library DLL. Through the user interface (as shown in gure 3.21), all available head-tracking tweaking options are exposed.
111
All users preferences are saved by the PreferencesHandler when the conguration form is closed, and restored when the form is reopened. PreferencesHandler achieves this functionality by recursively walking through the conguration forms component tree and storing/reading the values of checkboxes, sliders and combo-boxes to/from the special XML le. Finally, a DoubleBufferedPanel class is implemented to remove the icker-on-repaint artifacts when rendering the output from the head-tracking library (extends the WinForms Panel component to enable the double-buering functionality).
E.3.3
3D Game (Z-Tris)
Figure E.4 shows the model-view-controller architectural grouping of the Z-Tris game classes. Each of the M-V-C architectural units are discussed in more detail below.
E.3.3.1
Model
The main responsibility for the LogicHandler class is to maintain and update
The status of the pit (represented as a three-dimensional byte array), The status of the active (falling) polycube, Scores/line count/current level.
The status of the pit/active polycube is updated either on the users key press (notied by the KeyboardHandler controller), or when the time for the current move expires (notied by the internal timer). At the end of the move, LogicHandler updates the score s using the following formula: s s + line count line score fline count flevel + bempty pit line score flevel , (E.2)
where fline count and flevel are the multiplicative factors which increase with the number of layers and the number of levels (since the time allowance for each move decreases with increasing levels), and bempty pit is equal to 1 if the pit is empty and 0 otherwise. After a move is nished, a random polycube (represented as a 3 3 byte array in Polycube.Shapes dictionary) is added to the pit if it is not already full; otherwise, the game status is changed to LogicHandler.Status.GAME OVER. Both the model and the view are highly customizable (i.e. they can correctly process and render dierent pit sizes, polycube shape sets, timing constraints, scoring systems and so on).
E.3.3.2
Controller
KeyboardHandler class (full code listing given at appendix H.1) is responsible for interfacing between the user and the game logic. It operates using the following protocol:
112
Figure E.4: UML 2.0 class diagram of the Z-Tris game, grouped into the model-view-controller
architectural units.
113
When the user presses and holds a key on the keyboard, the rst OnKeyPress event is triggered immediately, the second is triggered after INITIAL KEY HOLD DELAY MS milliseconds and all the following events are triggered after REPEATED KEY HOLD DELAY MS milliseconds. 1. A keyboard key code is registered through a call to KeyboardHandler.RegisterKey(...) and an event handler (callback function) is registered with KeyboardHandler.OnKeyPress event. 2. KeyboardHandler monitors the state of the keyboard, and given that one of the registered keys was pressed, it noties the appropriate OnKeyPress event subscriber(-s). 3. If a key is not released, it repeatedly triggers OnKeyPress events according to the timing diagram shown in gure E.5. This class is capable of handling multiple key events simultaneously (as required for the control of the game), multiple keyboard event subscribers and customization of timing constraints.
Figure E.5: KeyboardHandler class event timing diagram.
E.3.3.3
View
View component (and in particular RenderHandler class) is responsible for

Rendering the static game state (pit and the active cube), Rendering the active cube animations (rotations and translations).
The animation of simultaneous rotations and translations of the active cube is achieved by keeping two vectors r and t which indicate the amount of rotation/translation animations remaining. At each frame, the active polycube is 1. Translated to the coordinate origin, 2. Rotated in all three directions simultaneously by the fraction r
timeFromPreviousRender KeyboardHandler.REPEATED KEY HOLD DELAY MS ,
3. Translated by t
timeFromPreviousRender KeyboardHandler.REPEATED KEY HOLD DELAY MS ,
and
4. Translated back to its original location. A screenshot of simultaneous translations and rotations is shown in gure E.6. The active polycube is also rendered as being semi-transparent, so as not to occlude the playing eld. It is achieved by
114
Figure E.6: Screenshot of Z-Tris game showing a simultaneous translation and rotation of the active
polycube around X- and Y-axes.

1. Rendering the active polycube as the last element of the scene with blending enabled, 2. Hiding the internal faces of individual cubes that make up the active polycube,
115
3. Culling the front faces and blending the remaining back faces of the polycube onto the scene, 4. Culling the back faces and blending the remaining front faces of the polycube onto the scene. The rest of the rendering details are described in section 3.7.1.
Appendix F HT3D Library Evaluation (Additional Details)

F.1
F.1.1
Evaluation Metrics
Sequence Track Detection Accuracy
STDA measure (introduced by Manohar et al. [32]) evaluates the performance of the object tracker in terms of the overall detection (number of objects detected, false alarms and missed detections), spatiotemporal accuracy of the detection (the proportion of the ground truth detected both in individual frames and in the whole tracking sequence) and the spatio-temporal fragmentation. The following notation (following the original paper) is used:
Gi
(t) (t) (t)
denotes the ith ground truth object in tth frame, denotes the ith detected object in tth frame,
(t)
Di
NG and ND denote the number of ground truth/detected objects in tth frame respectively, Nframes is the total number of ground truth frames in the sequence, Nmapped is the number of mapped ground truth and detected objects in the whole sequence, and N(Gi Di =) is the total number of frames in which either the ground truth object i, or the detected object i (or both) are present.
Then the Track Detection Accuracy (TDA) measure for ith object can be calculated as the spatial overlap (i.e. the ratio of the spatial intersection and union) between the ground truth and the tracking output of object i. More precisely, TDA can be dened as
Nframes
TDAi =
t=1
|Gi Di | |Gi Di |
(t) (t)
(t)
(t)
(F.1)
Observe that the TDA measure penalizes for both false negatives (undetected ground truth area) and false positives (detections that do not overlay any ground truth area). To obtain the STDA measure, TDA is averaged for the best mapping of all objects in the sequence, i.e.
Nmapped
STDA =
i=1
TDAi N(Gi Di =)
Nmapped
=
i=1
(t) Nframes |G(t) Di | i (t) (t) t=1 |Gi Di |
N(Gi Di =)
(F.2)
116
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS)
117
F.1.2
Multiple Object Tracking Accuracy/Precision
CLEAR (Classication of Events, Activities and Relationships) was an international eort to evaluate systems for the perception of people, their activities and interactions. CLEAR evaluation workshops [39] held in 2006 and 2007 introduced Multiple Object Tracking Precision (MOTP) and Multiple Object Tracking Accuracy (MOTA) metrics for 2D face tracking task evaluation [5]. MOTP metric attempts to evaluate the total error in estimated positions of ground truth/detection pairs for the whole sequence, averaged over the total number of matches made. More precisely, MOTP is dened as MOTP =
(j) Nmapped TDAi i=1 , Nframes (j) Nmapped j=1
(F.3)
where Nmapped is the number of mapped objects in j th frame. MOTA metric is derived from three error ratios (ratio of misses, false alarms and mismatches in the sequence, computed over the number of objects present in all frames) and attempts to assess the accuracy aspect of the systems performance. MOTA is dened as MOTA = 1
Nframes (cM (F Ni ) + cF P (F Pi ) i=1 Nframes (i) NG i=1
+ ln S)
(F.4)
where cM (x) and cF P (x) are the cost functions for missed detection and false alarm penalties, F Ni and F Pi are the numbers of false negatives/false positives in ith frame respectively and S is the total number of object ID switches for all objects. In turn, false negative and false positive counts are dened as
Nmapped
F Ni =
j=1 Nmapped
|G
(i) (i) \D | j j >F N (i) |G | j
(F.5)
F Pi =
j=1
(i) (i) |D \G | j j >F P (i) |D | j
(F.6)
where F N and F P are the false negative/false positive ratio thresholds (illustrated in gure F.1).
F.1.3
Average Normalized Distance from the Head Center
For motion parallax simulation, accurately localizing the face center is more important than achieving a higher spatio-temporal overlap between the detected and tagged objects. To measure HT3D colour, depth and combined head-trackers in this regard, an Average Normalized Distance from Head Center () metric is constructed. The ground truth head ellipse in frame i is described by its center location ci and the locations of the semi-major and semi-minor axes (points ai and bi respectively).
118
Figure F.1: False positive (false alarm) and false negative (miss) denitions for MOTA metric. Blue
ellipse indicates the detected head Di , red ellipse indicates tagged ground truth Gi . Let hi be the head center location in frame i, as predicted by the head tracker. Then the normalized distance between the detected and tagged head centres i can be calculated by transforming the ellipse into a unit circle centred around the origin, and measuring the length of the transformed head center vector (as shown in gure F.2). Let j be the angle between the major-axis of the ellipse and the xaxis in j th frame. Observe that j = cos1
(aj cj )i |aj cj |
Then the average normalized distance from the tagged head center can be calculated as 1 Nframes
Nframes
i =
i=1
1 Nframes
Nframes
cos i |ai ci | sin |bi cii |
sin i |ai ci | cos i |bi ci |
(hi ci )T . (F.7)
i=1
119
a)
b)
Figure F.2: -metric computation: a) ith input frame is transformed to image b) so that the ground
truth ellipse shown in red would be mapped to a unit circle centred at the origin. Then the normalized distance metric i is the length of the position vector given by transformed head center prediction (shown in blue).
120
F.2
F.2.1
Evaluation Set
Viola-Jones Face Detector Output
Figure F.3: Output of Viola-Jones face detector for all HT3D library evaluation recordings. False
positive face detections are marked with red crosses.
121
F.2.2
Metric for Individual Recordings
Head tracking accuracy ( metric evolution) for all evaluation set recordings is shown below.
Frame 74
Frame 136
Frame 219
Frame 228
Frame 232
Frame 239
Frame 244
Frame 252
Frame 258
Frame 655
Frame 731
Frame 761
Figure F.4: Head rotation (roll) recording. Marked red area indicates the output of the combined
(depth and colour) head tracker.
Figure F.5: Head-tracking accuracy ( metric) for Head rotation (roll) recording.
122
Frame 63
Frame 144
Frame 315
Frame 431
Frame 453
Frame 466
Frame 488
Frame 507
Frame 531
Frame 553
Frame 659
Frame 827
Figure F.6: Head rotation (yaw) recording.
Figure F.7: metric for Head rotation (yaw) recording.
123
Frame 63
Frame 81
Frame 144
Frame 167
Frame 204
Frame 228
Frame 259
Frame 264
Frame 290
Frame 353
Frame 492
Frame 657
Figure F.8: Head rotation (pitch) recording.
Figure F.9: metric for Head rotation (pitch) recording.
124
Frame 48
Frame 77
Frame 98
Frame 114
Frame 183
Frame 282
Frame 308
Frame 343
Frame 550
Frame 564
Frame 612
Frame 636
Figure F.10: Head rotation (all) recording.
Figure F.11: metric for Head rotation (all) recording.
125
Frame 54
Frame 71
Frame 139
Frame 166
Frame 248
Frame 346
Frame 465
Frame 531
Frame 549
Frame 591
Frame 718
Frame 735
Figure F.12: Head translation (horizontal and vertical) recording.
Figure F.13: metric for Head translation (horizontal and vertical) recording.
126
Frame 38
Frame 56
Frame 75
Frame 91
Frame 140
Frame 212
Frame 368
Frame 395
Frame 551
Frame 659
Frame 696
Frame 729
Figure F.14: Head translation (anterior/posterior) recording.
Figure F.15: metric for Head translation (anterior/posterior) recording.
127
Frame 57
Frame 153
Frame 199
Frame 237
Frame 299
Frame 334
Frame 374
Frame 479
Frame 563
Frame 619
Frame 743
Frame 783
Figure F.16: Head translation (all) recording.
Figure F.17: metric for Head translation (all) recording.
128
Frame 41
Frame 80
Frame 124
Frame 144
Frame 188
Frame 327
Frame 350
Frame 407
Frame 420
Frame 482
Frame 670
Frame 727
Figure F.18: Head rotation and translation (all) recording.
Figure F.19: metric for Head rotation and translation (all) recording.
129
Frame 64
Frame 104
Frame 152
Frame 237
Frame 354
Frame 386
Frame 591
Frame 642
Frame 668
Frame 687
Frame 743
Frame 812
Figure F.20: Participant #1 recording.
Figure F.21: metric for Participant #1 recording.
130
Frame 150
Frame 180
Frame 185
Frame 218
Frame 354
Frame 449
Frame 484
Frame 498
Frame 528
Frame 542
Frame 580
Frame 767
131
Frame 50
Frame 136
Frame 219
Frame 313
Frame 382
Frame 409
Frame 496
Frame 538
Frame 618
Frame 623
Frame 654
Frame 776
132
Frame 0
Frame 199
Frame 232
Frame 259
Frame 272
Frame 416
Frame 549
Frame 603
Frame 651
Frame 722
Frame 738
Frame 847
133
Frame 70
Frame 174
Frame 240
Frame 298
Frame 363
Frame 458
Frame 531
Frame 537
Frame 621
Frame 705
Frame 751
Frame 822
134
Frame 62
Frame 126
Frame 142
Frame 253
Frame 268
Frame 377
Frame 439
Frame 466
Frame 511
Frame 529
Frame 671
Frame 782
Figure F.30: Illumination (low) recording.
Figure F.31: metric for Illumination (low) recording.
135
Frame 11
Frame 70
Frame 156
Frame 242
Frame 275
Frame 361
Frame 534
Frame 551
Frame 653
Frame 728
Frame 770
Frame 840
Figure F.32: Illumination (changing) recording.
Figure F.33: metric for Illumination (changing) recording.
136
Frame 0
Frame 110
Frame 174
Frame 326
Frame 359
Frame 392
Frame 464
Frame 607
Frame 721
Frame 767
Frame 823
Frame 842
Figure F.34: Illumination (high) recording.
Figure F.35: metric for Illumination (high) recording.
137
Frame 22
Frame 129
Frame 175
Frame 196
Frame 215
Frame 247
Frame 266
Frame 402
Frame 510
Frame 690
Frame 700
Frame 818
Figure F.36: Facial expressions recording.
Figure F.37: metric for Facial expressions recording.
138
Frame 11
Frame 69
Frame 103
Frame 142
Frame 199
Frame 278
Frame 403
Frame 480
Frame 617
Frame 658
Frame 705
Frame 741
Figure F.38: Cluttered background recording.
Figure F.39: metric for Cluttered background recording.
139
Frame 50
Frame 106
Frame 163
Frame 367
Frame 379
Frame 410
Frame 416
Frame 433
Frame 564
Frame 616
Frame 620
Frame 731
Figure F.40: Occlusions recording.
Figure F.41: metric for Occlusions recording.
140
Frame 95
Frame 182
Frame 249
Frame 315
Frame 390
Frame 426
Frame 515
Frame 567
Frame 594
Frame 618
Frame 694
Frame 767
Figure F.42: Multiple viewers recording.
Figure F.43: metric for Multiple Viewers recording.
141
F.2.3
MOTA/MOTP Evaluation Results
MOTA/MOTP metrics for all evaluation recordings are summarized in gure F.44. Akin to STDA metric, depth and combined trackers outperform colour-based head tracker, but fall short of the interannotator agreement.
Figure F.44: MOTA/MOTP metrics for all evaluation recordings evaluated using default tracker settings given in table 4.6. Higher values indicates better performance.
Appendix G 3D Display Simulator (Z-Tris) Evaluation

The main intention behind Z-Tris implementation was to provide a proof-of-concept 3D application that strengthens depth perception using continuous motion parallax (obtained by changing the perspective projection based on viewers head position). To verify the operation of this proof-of-concept, a combination of automated and manual tests was used. The performance of 3D display simulator was also measured, to ensure that real-time rendering rates can be achieved while simulating all depth cues mentioned earlier.
G.1
Automated Testing
Unit Testing Framework provided by the Microsoft Visual Studio IDE was used to author/run unit and smoke tests. A sample such run is shown in gure 4.26. As summarized in table G.1, most of the code (around 85.25%) in Z-Tris core (main classes from gure E.4) was covered by automated testing. Automated smoke tests were also used as part of the regression testing to always maintain the application in a working state when progressing through development iterations.
Class name
RenderHandler LogicHandler SpriteHandler PreferencesHandler KeyboardHandler Polycube DisplayUtilities
Code coverage Covered (% blocks) (blocks) 85.38% 84.25% 67.27% 93.94% 100.00% 100.00% 83.33% 85.25% 543 385 74 93 58 45 10 1, 208
Not covered (blocks) 93 72 36 6 0 0 2 209
Total:
Table G.1: Z-Tris core unit test code coverage.
142
APPENDIX G. 3D DISPLAY SIMULATOR (Z-TRIS) EVALUATION
143
a)
b)
c)
d)
Figure G.1: Z-Tris o-axis projection and head-tracking manual integration testing: the same scene is rendered with the viewers head positioned at the a) left, b) right, c) top and d) bottom bevels of the display.
G.2
Manual Testing
Manual tests included hours of functional testing, both to evaluate the requirements given in section 2.3 and to perform basic usability and sanity tests. A signicant amount of time has also been spent performing the manual system integration testing. Figure G.1 shows one example integration test scenario, where the same scene is rendered from four dierent viewpoints based on viewers head position to test the integration of head-tracking and o-axis projection rendering.
G.3
Performance
After integrating the 3D display rendering subsystem and HT3D head-tracking library, the overall system run-time performance has been measured. The system was built in a 64-bit Release mode,
APPENDIX G. 3D DISPLAY SIMULATOR (Z-TRIS) EVALUATION
144
with no debug information, and with optimizations enabled. The nal setup of the integrated system is shown in gure 3.19. The overall systems performance was measured using the main development machine, running a 64-bit Windows 7 OS on a dual-core hyperthreaded Intel Core i5-2410M CPU @ 2.30 GHz. As expected, the real-time average Z-Tris game rendering speed (with the combined colour and depth head tracker enabled) was 29.969 frames-per-second (with the standard deviation of 5.167 frames), i.e. real-time requirements were satised. A single CPU core experienced an average load of 64.98% (minimum 17.13%, maximum 88.01%) per 5 minutes of game play, indicating that further processing resources were still available.
Appendix H Sample Code Listings

Listing H.1: KeyboardHandler code.
using using using using System ; System . Collections . Generic ; System . Text ; OpenTK . Input ;
namespace ZTris { // / < summary > // / < para > // / Keyboard handler , responsible for interfacing between the user and the game // / logic . // / </ para > // / < para > // / This class is capable of handling multiple key events simult aneously ( as // / required for the control of the game ) , multiple keyboard event subscribers // / and customization of timing constraints . // / </ para > // / < para > // / It operates using the following protocol : // / < list type =" bullet " > // / < item > // / A keyboard key code and the caller are registered through a call to // / < see cref =" RegisterKey ( Key key , object sender ) " > , and an event // / handler ( callback function ) is registered through // / < see cref =" OnKeyPress " > event . // / </ item > // / < item > // / < see cref =" K ey b oa rd Ha n dl er " > monitors the state of the keyboard , and // / when one of the registered keys was pressed , it notifies all // / < see cref =" OnKeyPress " > event subscribers . // / </ item > // / < item > // / If a key is not released , it repeatedly triggers < see cref =" OnKeyPress " > // / events according to < see cref =" I N I T I A L _ K E Y _ H O L D _ D E L A Y _ M S " > and // / < see cref =" R E P E A T E D _ K E Y _ H O L D _ D E L A Y _ M S " > timing . // / </ item > // / </ list > // / </ para > // / </ summary > public class Ke y bo ar d Ha nd le r { # region Internal classes // / < summary > // / Internal mutable key state r epresen tation . // / </ summary > private class KeyState { public DateTime LastPressTime ; public bool IsRepeated ; public bool IsFirst ; }
145
APPENDIX H. SAMPLE CODE LISTINGS

# endregion
146
# region Constants // / < summary > Represents the initial key hold delay until the second key event is // / trigerred . </ summary > public const int I N I T I A L _ K E Y _ H O L D _ D E L A Y _ M S = 400; // / < summary > Represents the key hold delay until the third ( and all subsequent ) // / key events are trigerred . </ summary > public const int R E P E A T E D _ K E Y _ H O L D _ D E L A Y _ M S = 180; # endregion
# region Private fields // / < summary > Interfaces keyboard handler . </ summary > private IK ey b oa rd De v ic e _keyboard = null ; // / < summary > Maps keyboard keys to their states . </ summary > private Dictionary < Key , KeyState > _pressedKeys = new Dictionary < Key , KeyState >() ; // / < summary > Maps registered keyboard keys to their subscribers . </ summary > private Dictionary < Key , List < object > > _r eg i st er ed K ey s = new Dictionary < Key , List < object > >() ; # endregion
# region Public fields // / < summary > Key press event handler type . </ summary > // / < param name =" key " > Keyboard key that triggered the event . </ param > public delegate void K ey Ev en t Ha nd le r ( Key key ) ; // / < summary > Key press event handler . </ summary > public event K ey Ev en t Ha nd le r OnKeyPress ; # endregion
# region Constructors // / < summary > Default keyboard handler constructor . </ summary > // / < param name =" keyboard " > Keyboard interface . </ param > public K e yb oa rd H an dl er ( IK e yb oa rd D ev ic e keyboard ) { _keyboard = keyboard ; } # endregion
# region Public methods // / < summary > // / A method to register subscriber ' s interest in a particular key press . // / Typically this method would be called as <c > RegisterKey (... , this ) </c >. // / </ summary > // / < param name =" key " > Keyboard key to register . </ param > // / < param name =" subscriber " > Handle to the subscriber . </ param > public void RegisterKey ( Key key , object subscriber ) { _pressedKeys . Add ( key , new KeyState () {

LastPressTime = DateTime . Now , IsRepeated = false , IsFirst = true }) ; if (! _ re gi s te re dK e ys . ContainsKey ( key ) ) { _ re gi st e re dK ey s . Add ( key , new List < object >() ) ; } _ re gi st e re dK ey s [ key ]. Add ( subscriber ) ; } // / < summary > // / A method to register subscriber ' s interest in particular key presses . // / Typically this method would be called as <c > RegisterKeys (... , this ) </c >. // / </ summary > // / < param name =" keys " > Keyboard keys to register . </ param > // / < param name =" subscriber " > Handle to the subscriber . </ param > public void RegisterKeys ( Key [] keys , object subscriber ) { foreach ( Key key in keys ) { this . RegisterKey ( key , subscriber ) ; } } // / < summary > // / Main event processing loop where the subscribed key events are triggered . // / </ summary > public void UpdateStatus () { // Record key press time before processing DateTime keyPressTime = DateTime . Now ; // Check the status of each registered key foreach ( Key key in _ re gi s te re dK e ys . Keys ) { KeyState p re s se dK e yS ta te = _pressedKeys [ key ]; if ( _keyboard [ key ]) { bool t r i g g e r K e y P r e s s E v e n t = false ; // If the key is pressed for the first time , trigger the event immediately if ( pr e ss ed Ke y St at e . IsFirst ) { t r i g g e r K e y P r e s s E v e n t = true ; p re ss ed K ey St at e . IsFirst = false ; } // If the key was held , trigger the events accordingly to timing constraints else { double t i m e S i n c e L a s t P r e s s I n M s = keyPressTime . Subtract ( p re s se dK ey S ta te . LastPressTime ) . TotalMilliseconds ; t r i g g e r K e y P r e s s E v e n t = ( p re ss e dK ey St a te . IsRepeated ? ( timeSinceLastPressInMs > REPEATED_KEY_HOLD_DELAY_MS ) : ( timeSinceLastPressInMs > INITIAL_KEY_HOLD_DELAY_MS )); // Update key state if ( t r i g g e r K e y P r e s s E v e n t ) {
147

p re ss e dK ey St a te . IsRepeated = true ; } } if ( t r i g g e r K e y P r e s s E v e n t ) { // Record last press time p re ss ed K ey St at e . LastPressTime = keyPressTime ; // Trigger subscriber event handlers this . C a l l b a c k S u b s c r i b e r s ( key ) ; } } else { p re ss ed K ey St at e . IsRepeated = false ; p re ss ed K ey St at e . IsFirst = true ; } } } # endregion
148
# region Private methods // / < summary > // / Calls back the event handlers of subscribers to a particular key press . // / </ summary > // / < param name =" key " > Keyboard key that was pressed . </ param > private void C a l l b a c k S u b s c r i b e r s ( Key key ) { foreach ( Delegate eventCallback in OnKeyPress . G e t I n v o c a t i o n L i s t () ) { if ( _ re gi st e re dK ey s [ key ]. Contains ( eventCallback . Target ) ) { eventCallback . DynamicInvoke ( key ) ; } } } # endregion } }
Listing H.2: IKeyboardDevice interface.

using System ; using OpenTK . Input ; namespace ZTris { // / < summary > // / Keyboard device interface , responsible for providing the keyboard status . // / </ summary > public interface IK e yb oa rd D ev ic e { // / < summary > // / An indexer returning a status of the particular key . // / </ summary > // / < param name =" key " > Keyboard key of interest . </ param > // / < returns > // / Status of < see cref =" key "/ >: // / < list type =" table " >

// / < item > // / < term > true </ term > // / < description > < see cref =" key "/ > is pressed . </ description > // / </ item > // / < item > // / < term > false </ term > // / < description > < see cref =" key "/ > is released . </ description > // / </ item > // / </ list > // / </ returns > bool this [ Key key ] { get ; } } }
149
MEASURING HEAD DETECTION AND TRACKING SYSTEM ACCURACY

EXPERIMENT CONSENT FORM
EXPERIMENT PURPOSE
This experiment is part of the Computer Science Tripos Part II project evaluation. The project in question involves using a Microsoft Kinect sensor to track viewers head position in space. The main purpose of the experiment is to ensure that the face detector/tracker is robust and works for different viewers.
EXPERIMENT PROCEDURE
The experiment consists of recording two colour and depth videos (each 30 seconds long) of the participant moving his/her head in the free-form manner. A possible range of head/face muscle motions that can be performed include (but are not limited to): Head rotation (yaw/pitch/roll) Yaw
Pitch Roll
Head translation (horizontal/vertical, anterior/posterior)
Vertical
Posterior
Horizontal Anterior
Facial expressions, e.g. joy, surprise, fear, anger, disgust, sadness, etc. 1
CONFIDENTIALITY
The following data will be stored: two (2) colour and depth recordings (each 30 seconds long). No other personal data will be retained. Recorded videos will be kept in accordance to the Data Protection Act and destroyed after the submission of the dissertation.
FINDING OUT ABOUT RESULT

If interested, you can find out the result of the study by contacting Manfredas Zabarauskas, after 18/05/2012. His phone number is 0754 195 8411 and his email address is mz297@cam.ac.uk.
PLEASE NOTE THAT: - YOU HAVE THE RIGHT TO STOP PARTICIPATING IN THE EXPERIMENT, POSSIBLY WITHOUT GIVING A REASON. - YOU HAVE THE RIGHT TO OBTAIN FURTHER INFORMATION ABOUT THE PURPOSE AND THE OUTCOMES OF THE EXPERIMENT. - NONE OF THE TASKS IS A TEST OF YOUR PERSONAL ABILITY. THE OBJECTIVE IS TO TEST THE ACCURACY OF THE IMPLEMENTED HEAD TRACKING SYSTEM.
RECORD OF CONSENT
Your signature below indicates that you have understood the information about the Measuring Head Detection and Tracking System Accuracy experiment and consent to your participation. The participation is voluntary and you may refuse to answer certain questions on the questionnaire and withdraw from the study at any time with no penalty. This does not waive your legal rights. You should have received a copy of the consent form for your own record. If you have further questions related to this research, please contact the researcher.
Participant (Name, Signature): __________________________________________________________________________ Researcher (Name, Signature): __________________________________________________________________________ 2
Date (dd/mm/yy): __________________________________ Date (dd/mm/yy): __________________________________
MEASURING HEAD DETECTION AND TRACKING SYSTEM ACCURACY

VIDEO AND DEPTH RECORDING RELEASE FORM
RELEASE STATEMENT
I HEREBY ASSIGN AND GRANT TO MANFREDAS ZABARAUSKAS THE RIGHT AND PERMISSION TO USE AND PUBLISH (PARTIALLY OR IN FULL) THE VIDEO AND/OR DEPTH RECORDINGS MADE DURING THE MEASURING HEAD DETECTION AND TRACKING SYSTEM ACCURACY EXPERIMENT, PUBLICATION. AND I HEREBY RELEASE MANFREDAS ZABARAUSKAS FROM ANY AND ALL LIABILITY FROM SUCH USE AND
Participant (Name, Signature): __________________________________________________________________________ Researcher (Name, Signature): __________________________________________________________________________
Date (dd/mm/yy): _________________________ Date (dd/mm/yy): _________________________
Appendix I Project Proposal
Computer Science Tripos Part II Project Proposal 3D Display Simulation Using Head Tracking with Microsoft Kinect
M. Zabarauskas, Wolfson College (mz297) Originator: M. Zabarauskas 20 October 2011
Project Supervisor: Prof N. Dodgson Signature: Director of Studies: Dr C. Town Signature: Project Overseers: Dr S. Clark & Prof J. Crowcroft Signatures:
APPENDIX I. PROJECT PROPOSAL
154
Introduction
Reliable real-time human face detection and tracking has been one of the most interesting problems in the eld of computer vision in the past few decades. The emergence of cheap and ubiquitous Microsoft Kinect sensor containing an IR-depth camera provides new opportunities to enhance the reliability and speed of face detection and tracking. Moreover, the ability to use the depth information to track the users head in 3D space opens up a lot of potential for new immersive user interfaces. In my project I want to implement widely recognized, industry-standard face detection and tracking methods: a Viola-Jones object detection framework and a CAMShift (Continuously Adaptive Mean Shift) face tracker, based on the ideas presented by the authors in their original papers. Having achieved that, I want to explore the opportunities of using the current state-of-the-art methods to integrate the depth information into face detection and tracking algorithms, in order to increase their speed and accuracy. As a next and nal part of the project, I want to employ the depth information provided by Kinect to obtain an accurate 3D location of the viewer with respect to the display. The knowledge of viewer heads coordinates in 3D will allow me to simulate the parallax motion that occurs between the visually overlapping near and far objects in a 3D scene when the users viewpoint changes, mimicking a three-dimensional display viewing experience.
Method Descriptions
The Viola-Jones face detector mentioned above is a breakthrough method for face detection, proposed by Viola and Jones [43] in 2001. They described a family of extremely simple classiers (called rectangle features, reminiscent of Haar wavelets) and a representation of a grayscale image (called integral image), using which these Haar-like features can be calculated in constant time. Then, using a classier boosting algorithm based on AdaBoost a number of most eective features can be extracted and combined to yield an ecient strong classier with an extremely low false negative rate, but a high false positive rate. Finally, they proposed a method to arrange strong classiers into a linear cascade, which can quickly discard non-face regions, focusing on likely face regions in the image to decrease the false positive rate. After the face has been localized in the image, it can be eciently tracked using the face colour distribution information. CAMShift (Continuously Adaptive Mean Shift) was rst proposed by Gary Bradski [8] at Intel in 1998. In this method, a hue histogram of a face that is being tracked is used to derive the face probability distribution, where the most frequently occurring colour is assigned the probability 1.0 and the probabilities of other colours are computed based on their relative frequency to the most frequent colour. Then, given a new search window a mean shift algorithm is used (with a simple step-function as a kernel) to converge to a probability centroid of the face colour probability distribution. The size of the search window is then adjusted as a function of the zeroth moment, and the repositioning/resizing is repeated until the result changes less than a xed threshold. However, these colour information based face detection and tracking methods encounter diculties in situations when the face orientation does not match the training ones (e.g. when the user is facing away from the camera), when the background is visually cluttered and so on. Burgin et al. [10] suggested a few simple ways how the depth information could be used to improve face detection. For example, given a certain distance from the camera, the realistic range of human
155
head sizes in pixels can be calculated. This can then be used to reject certain window sizes and to improve on the exhaustive search for faces in an entire image in Viola-Jones algorithm. Similarly, they suggested that the distance thresholding could also be used to improve the face detection eciency, since far-away points are likely to be blurry or to contain too few pixels for reliable face detection. On a similar note, Xia et al. [31] described a 3D model tting algorithm for the head tracking. Their algorithm scales a hemisphere model to the head size estimated from the depth values of the location possibly containing a head (using an equation regressed from the empirical head size/depth measurements). Then it attempts to minimize the square error between the possible head region and the hemisphere template. Since this approach uses the generalized head depth characteristics (front, side, back views, as well as higher and lower views of the head approximate a hemisphere), it is view invariant. When combined with CAMShift face tracker, the 3D model tting approach should enhance the reliability of the overall tracking even when the person turns to look away for a few seconds. These improved ways of face detection and tracking, combined with the depth information provided by Kinect can be employed to obtain the accurate 3D location of the viewer with respect to the display. The location can then be used to simulate the parallax motion (near objects moving faster in relation to far objects), evoking a visual sense of depth as perceived in real three-dimensional environments.
Resources Required
Hardware
Microsoft Kinect sensor. Acquired. Development PC. Acquired: hyperthreaded dual-core Intel i5-2410M running at 2.90 GHz, 8 GB RAM, 250 GB HDD. Primary back-up storage: 0.5 GB space on PWF for program and dissertation sources only. Acquired. Secondary back-up storage: 16 GB USB ash drive for source code/ dissertation/ built snapshots. Acquired.
Software
Development: Microsoft Visual Studio 2010, Kinect SDK, OpenTK (OpenGL wrapper for C#), Math.NET (mathematical open source library for C#). All installed. Back-up: Subversion CVS. Installed both on local machine and on PWF.
Training data
Face/non-face training images for Viola-Jones. Acquired 4916 face images and 7960 nonface images from Robert Pless website [11].
Starting Point
Basic knowledge of C#,

Minimal familiarity with OpenGL, Nearly no knowledge of computer vision.
156
Substance and Structure of the Project

As discussed in the introduction, the substance of the project can be split into the following stages:
to implement the industry standard colour-based face detection and tracking algorithms (viz. Viola-Jones and CAMShift), to extend these algorithms using the depth information provided by Microsoft Kinects IR-depth sensing camera, to simulate the parallax motion eect using the calculated head movements in 3D, creating a 3D display eect.

As described in the introduction, the main task will be to implement the AdaBoost algorithm which will combine Haar-like weak classiers into a strong classier. These classiers will be connected into a classier cascade, such that early stages reject the image locations that are not likely to contain the faces. It is crucial to get this stage implemented early, since classier training can take days/weeks.

The main task for the face tracker will be to implement the histogram backprojection method and the mean-shift algorithm.
Depth Cue Integration for Viola-Jones Detector

Since the suggestions in Burgin et al. [10] paper are relatively straightforward (e.g. distance thresholding), the main task will be simply to code the unnecessary image region elimination before launching the Viola-Jones detector.
Tracker Extension Using Depth Cues

Based on Xia et al. [31] approach, 3D hemisphere tting will have to be implemented. However, additional work will be required to ensure that when the CAMShift colour-based tracker loses the faces, depth-based tracker reliably takes over, and vice versa.
157
3D Display Simulation Using Parallax Motion

Having obtained the head location in pixel (and depth) coordinates during the stages above, the heads location in 3D can be calculated using publicly available conversion equations (derived by measuring the focal distances, distortion coecients and other parameters of both depth and RGB cameras). To simulate the eect of parallax motion, a simple OpenGL scene will be created and a scenes viewpoint will be set to follow the heads motions in 3D.
Success Criteria
For the project to be deemed success, the following items have to be completed: 1. Viola-Jones face detector, 2. CAMShift tracker, 3. Viola-Jones detector extensions using depth cues, 4. 3D hemisphere-tting tracker, 5. OpenGL program, simulating the parallax motion eect. Furthermore, the implemented items should have a comparable performance to the one in papers that describe these methods.
Evaluation Criteria
Face detector and trackers can be quantitatively evaluated on its speed and ROC (receiver operating characteristic), i.e. the rate of correct detections versus the false positive rate, as well as on the precision (T P/T P + F P ), recall (T P/T P + F N ), accuracy (T P/T P + T N + F P + F N ) and other metrics. Similarly, its robustness against dierent head orientations (tilt, rotation), distance to camera, speed of movement, global illumination conditions, etc can be quantitatively measured. Then the relative performance and accuracy gain/loss, obtained by adding the depth cues to the face detector/tracker, can be obtained. Finally, the accuracy of head location tracking in 3D (with respect to head translation in X, Y and Z-axis) can be assessed.
158
Possible Extensions
Given enough time, the system could be extended to deal with multiple people. This would involve only minor changes to the Viola-Jones detector, but should be more challenging for the trackers. Both colour and depth information based trackers should now deal with partial/full occlusion and object tagging (i.e. if person A passes behind person B, it should not treat person A as a new person in the image, and should not get confused between A and B); depth information tracker should have more potential in disambiguating these situations. After implementing the extension above, OpenGL scene could be trivially segmented so that each viewer would see her own 3D segment of a display.
Work Plan
The work will be split into 16 two-week packages, as detailed below:
07/10/11 - 20/10/11
Gain a better understanding in face detection and tracking methods described above. Set up the development and backing-up environment. Obtain the colour and depth input streams from Kinect. Write the project proposal. Milestones: SVN set up on PWF. Written a small C# test project for Microsoft Visual Studio 2010 and Kinect SDK, that fetches the input colour and depth streams from the device, and renders them on a screen. Project proposal written and handed in.
21/10/11 - 03/11/11
Fully understand the Viola-Jones face detector and start implementing it. Add additional face images to the training set if required. Milestones: clear understanding of Viola-Jones face detector. tion. Pieces of working implementa-
04/11/11 - 17/11/11
Finish implementing the Viola-Jones algorithm and start the training. CAMShift algorithm. Milestones: implementation of Viola-Jones face detector. Start reading about the
159
18/11/11 - 01/12/11
Fully understand and implement the CAMShift face tracker. Integrate it to the Viola-Jones face detector as a next stage (when the face is detected). Milestones: implementation of CAMShift tracker, integrated into the system.
02/12/11 - 15/12/11
Add depth cues to the Viola-Jones face detector, start reading about the 3D hemisphere-tting tracker. Milestones: depth cues added to the Viola-Jones detector. Clear understanding of 3D hemispheretting tracker.
16/12/11 - 29/12/11
Implement the 3D hemisphere-tting tracker and integrate it to the system to start tracking in parallel with CAMShift algorithm when Viola-Jones detects a face in the image. Start reading about the parallax motion simulation. Milestones: implementation of 3D hemisphere-tting tracker, integrated into the system. Clear understanding of how the parallax motion could be simulated knowing the heads position.
30/12/11 - 12/01/12
Prepare the presentation for the progress meeting in January. Write progress report. Slack time in case any of the face detector/face trackers/progress report/progress presentation are not nished. Milestones: presentation for the progress meeting and a progress report.
13/01/12 - 26/01/12
Fully understand how heads pixel and depth coordinates can be converted into its location in 3D space. Further research on how the parallax motion can be simulated from the head location in 3D. Start implementing an OpenGL scene which could be used to display the parallax motion eect. Milestones: basic implementation of an OpenGL scene.
27/01/12 - 09/02/12
Finish implementing an OpenGL scene. Slack time for any unnished implementation details.
160
Milestones: nished implementation of an OpenGL scene. At this stage the overall system should be functional, i.e. it should combine the output from the face detector and face trackers to obtain the heads location in 3D and use it to simulate parallax motion on the display.
10/02/12 - 23/02/12
Start writing a dissertation. Come up with a structure, including sections, subheadings and short bullet points to be covered in each section. Milestones: basic structure of the dissertation.
24/02/12 - 09/03/12
Write the Introduction and Preparation sections. Get feedback from the supervisor/DoS. Milestones: complete Introduction and Preparation sections.
10/03/12 - 23/03/12
Milestones: Incorporate the feedback from the supervisor/DoS regarding the Introduction and Preparation sections. Write the Implementation section and send for feedback to supervisor/DoS.
24/03/12 - 06/04/12
Incorporate the feedback from the supervisor/DoS regarding the Implementation section. Gather the numerical data for Evaluation section. Slack time for nishing Introduction, Preparation and Implementation sections. Milestones: nished Introduction, Preparation and Implementation sections. Gathered data for Evaluation section.
07/04/12 - 20/04/12
Write the Evaluation section and send it for feedback to DoS/supervisor. Milestones: nished Evaluation section.
21/04/12 - 04/05/12
Incorporate the feedback for Evaluation section and nish a draft dissertation. Send it for nal feedback to supervisor/DoS. Milestones: nished draft dissertation.
161
05/05/12 - 18/05/12
Incorporate nal feedback from supervisor/DoS and get the nal version approved. Milestones: dissertation is nished, approved, bound and handed in before 18/05/2012.

3D Display Simulation Using Head Tracking With Microsoft Kinect (Printing)

Uploaded by

Copyright:

Available Formats

You might also like

3D Display Simulation Using Head Tracking With Microsoft Kinect (Printing)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3D Display Simulation Using Head Tracking With Microsoft Kinect (Printing)

Uploaded by

Copyright:

Available Formats

Manfredas Zabarauskas

3D Display Simulation Using Head-Tracking with Microsoft Kinect

Original Aims of the Project

(Z-Tris) Evaluation 142 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 145 153

Human Depth Perception

Depth Cue Comparison

Related Work on 3D Displays

Simulated depth cues

Standard LCD/CRT monitor

Continuous Discrete1 Continuous Continuous

Detailed Project Aims

Before starting the project I had

Data Flow and System Components

Image Processing and Computer Vision Methods

Figure 2.1: Project data ow as a sequence of data transformations performed by corresponding

Viola-Jones Face Detector

Figure 2.3: Integral image representation used

CAMShift Face Tracker

Centroid Calculation and Algorithm Convergence

ViBe Background Subtractor

Background Model Update

Peters-Garstka Head Detector

(2.14) Ir (u , v ) > 20 cm,

as a possible point on a vertical head axis.

Ir ((v i), v i), u

Depth-Based Head Tracker

Languages and Tools

Figure 3.1: (Partial) GANTT chart showing projects status as of 18/12/2011.

Code Versioning and Backup Policy

Viola-Jones Detector Distributed Training Framework

with h, w being the height and width (respectively) of

HT3D image frame rendering options, headTracker.EnabledRenderingCapabilities[c] = true.

Colour-Based Face Detector

Colour-Based Face Tracker

Colour- and Depth-Based Background Subtractors

Depth-Based Head Detector and Tracker

xt = xt1 + (xt xt1 ).

CHAPTER 3. IMPLEMENTATION 3.7.1.1 Generalized Perspective Projection

Listing 3.1: Generalized projection matrix implementation in OpenGL.

attenuation(l, v) spotlight(l, v) cl,a + vn , 0 vn , 0 cl,d cv,d +

lv ||l v|| l+v

Viola-Jones Face Detector

Face Detector Accuracy Evaluation

Face Detector Speed Evaluation

HT3D (Head-Tracking in 3D) Library

See appendix F.1 for full metric descriptions.

Overlap 75% Overlap < 75%

Using default settings as given in table 4.6, unless otherwise noted.

Figure 4.12: metric evolution over time for Participant #5 recording.

CHAPTER 4. EVALUATION Setting

69 Default value 20 0.4 True 32 64

True 1 0.8 0.8 0.4

Table 4.6: Default HT3D library settings. 4.2.1.9 Summary

0.8259 0.3764 0.3554 0.6024 0.3270 0.5926 0.0979 0.8286

27.833 27.795 27.914 27.568 28.243 27.830 27.460 28.060 27.966

CHAPTER 4. EVALUATION 4.2.2.1 Hot Paths

3D Display Simulator (Z-Tris)11

To the same end, a number of published methods were implemented:

Appendix A Depth Cue Perception