Body Recogni6on and Tracking:

Kinect RGB-D Camera
How the Kinect RGB-D Camera Works
MicrosoC Kinect for Xbox 360
aka Kinect 1 (2010)
Color video camera + laser-
projected IR dot paUern + IR
IR laser projector
color camera


IR camera

Many slides by D. Hoiem

640 x 480, 30 fps

What the Kinect Does

Compute Depth Image

2016 will be the year that we see interes6ng

new applica6ons of depth camera technology
on mobile phones.
-- Chris Bishop, Director of MicrosoC
Research, Cambridge (2015)

Application (e.g., game)

Estimate body parts and joint poses

How Kinect Works: Overview Stereo from Projected Dots
IR Projector IR Projector

IR Sensor Projected Light Pattern

IR Sensor Projected Light Pattern

Stereo Stereo
Algorithm Algorithm

Segmentation, Segmentation,
Part Prediction Part Prediction

Depth Image Body parts and joint positions Depth Image Body parts and joint positions

Stereo from Projected Dots Depth from Stereo Images

image 1 image 2

1. Overview of depth from stereo

2. How it works for a projector/sensor pair

Dense depth map

3. Stereo algorithm used

Some of following slides adapted from Steve Seitz and Lana Lazebnik
Depth from Stereo Images Basic Stereo Matching Algorithm
Goal: recover depth by nding image coordinate x in
Image 2 that corresponds to x in Image 1


x For each pixel in the rst image
x x
Find corresponding epipolar line in the right image
f f Examine all pixels on the epipolar line and pick the best
C Baseline C
B Triangulate the matches to get depth informa6on

Depth from Disparity Basic Stereo Matching Algorithm


x x f
= z
O O z x x

f f
O Baseline O If necessary, rec6fy the two stereo images to transform
B epipolar lines into scanlines
For each pixel x in the rst image
B f B f
disparity = x x = z= Find corresponding epipolar scanline in the right image
z x x Examine all pixels on the scanline and pick the best match x
Compute disparity x-x and set depth(x) = fB/(x-x)
Disparity is inversely proportional to depth, z
Correspondence Search Results of Window Search
Left Right Data


Matching cost
Window-based matching Ground truth

Slide a window along the right scanline and compare

contents of that window with the reference window in
the leC image
Matching cost: SSD or normalized cross-correla6on

Improve by Adding Constraints and Solve Failures of Correspondence Search

with Graph Cuts

Textureless surfaces Occlusions, repeated structures

Graph cuts Ground truth

Y. Boykov, O. Veksler, and R. Zabih,
Fast Approximate Energy Minimization via Graph Cuts, PAMI 2001
For the latest and greatest: Non-Lambertian surfaces, specularities

Structured Light Example: Book vs. No Book

Basic Principle
Use a projector to create known features in the 3D scene
(e.g., points, lines)
Light projec6on
If we project dis6nc6ve points, matching is easy


Example: Book vs. No Book Kinects Projected Dot PaUern

Same Stereo Algorithms Apply Kinect RGB-D Camera

Projector Sensor

Implementa6on Kinect for Xbox One

In-camera ASIC computes 11-bit 640 x 480 aka Kinect 2 (2013)
Replaced Structured-Light Camera by
depth map at 30 Hz
Time-of-Flight Camera
Range limit for tracking: 0.7 6 m (2.3 to 20) Higher resolu6on (1080p), larger view
Prac6cal range limit: 1.2 3.5 m of view , 30 fps camera
Depth resolu6on 2.5cm at 4m
Time-of-Flight Depth Sensing Kinect 2s Time of Flight Sensor
light p d
source ulse Kinect 2 uses mul6ple measurements (3 pulse
frequencies x 3 amplitudes) to compute at
stop-watch scene
each pixel:

rece ulse The amount of reected light origina6ng from the
sensor t p
ac6ve light source (called the ac6ve image)
depth = c / 2t,
The depth of the scene from the phase shiCs for
where c = speed the mul6ple measurements (which disambiguate
6me delay t of light
Impulse Time-of-Flight Imaging the depth)

The amount of ambient light

emiUed pulse
received pulse

[Koechner, 1968]

Part 2: Pose from Depth Goal: Es6mate Pose from Depth Image
IR Projector

IR Sensor Projected Light Pattern


Part Prediction

Real-Time Human Pose Recognition in Parts from a Single Depth Image,

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman,
Depth Image Body parts and joint positions and A. Blake, Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011
Goal: Es6mate Pose from Depth Image Challenges
Step 1. Find body parts Lots of varia6on in bodies, orienta6ons, poses
Step 2. Compute joint positions Needs to be very fast (their algorithm runs at 200
fps on the Xbox 360 GPU)

Pose Examples

RGB Depth Part Label Map Joint Positions

Examples of
one part

Finding Body Parts Extract Body Pixels by Thresholding Depth

What should we use for a feature?
Dierence in depth

What should we use for a classier?
Random Forest / Decision Forest
Features Part Classica6on with Random Forests
Difference of depth at two pixels Random Forest: collec6on of independently-trained
Offset is scaled by depth at reference pixel binary decision trees
Each tree is a classier that predicts the likelihood of a
pixel x belonging to body part class c
Non-leaf node corresponds to a thresholded feature
Leaf node corresponds to a conjunc6on of several features
At leaf node store learned distribu6on P(c|I, x)

dI(x) is depth image, = (u, v) is oset to second pixel

Classica6on Classica6on
Tes5ng Phase:
1. Classify each pixel x in image I using all
decision trees and average the results at
the leaves:

Learning Phase:

1. For each tree, pick a randomly sampled subset of training data

2. Randomly choose a set of features and thresholds at each node
3. Pick the feature and threshold that give the largest information gain
4. Recurse until a certain accuracy is reached or tree-depth is obtained
Implementa6on Get Lots of Training Data
31 body parts Capture and sample 500K mo6on capture
frames of people kicking, driving, dancing, etc.
3 trees (depth 20)
300,000 training images per tree randomly Get 3D models for 15 bodies with a variety of
selected from 1M training images weights, heights, etc.
2,000 training example pixels per image Synthesize mo6on capture data for all 15 body
2,000 candidate features types
50 candidate thresholds per feature
Decision forest constructed in 1 day on 1,000
core cluster

Step 2: Joint Posi6on Es6ma6on Results

Joints are es6mated using the mean-shi;

clustering algorithm applied to the labeled
Gaussian-weighted density es6mator for each
body part to nd its mode 3D posi6on
Push back in depth each cluster mode to lie
at approx. center of the body part
73% joint predic6on accuracy (on head,
shoulders, elbows, hands)

Cameras for Tracking

Leap Mo6on
2 x 2 x 2 volume
2015, $80

