Download as pdf or txt
Download as pdf or txt
You are on page 1of 147

CS223A

Vision in Robotics

Professor Silvio Savarese


Computational Vision and Geometry Lab

Silvio Savarese

21-Jan-15

Sensing is the future

Sensing is the future

Sensing is the future

Sensing is the future

Everything is a sensor

Everything is a sensor

Everything is a sensor

Modern vision sensors


w/ gravity

night

thermal

Kinect

Sensing is not the hard problem


Intelligent understanding of the
sensing data is the challenge!
What does it mean intelligent
understanding of the sensing data?

Computer vision
Computer vision studies the tools and theories that enable the design of machines
that can extract useful information from imagery data
(images and videos) toward the goal of interpreting the world

Extract
information
Interpretation
Sensing device

Computational
device

Information: features, 3D structure, motion flows, etc


Interpretation: recognize objects, scenes, actions, events

Computer vision and Applications

EosSystems

1990

2000

2010
12

Fingerprint biometrics

Augmentation with 3D
computer graphics

14

3D object prototyping

EosSystems

Photomodeler

15

Computer vision and Applications


New features detector/descriptors
CV leverages machine learning

Autostich

EosSystems

1990

2000

2010
16

Face detection

Face detection

Web applications

Photometria
19

Panoramic Photography

kolor

3D modeling of landmarks

21

Computer vision and Applications

Large scale image matching


Efficient SLAM/SFM
Better clouds
More bandwidth
Increase computational power
Kinect
Google
Goggles

Autostich

EosSystems

1990

2000

A9
Kooaba

2010
22

Image search engines

Movies, news, sports

Visual search and


landmarks recognition

Google Goggles
24

Visual search and


landmarks recognition

25

Augmented reality

26

Motion sensing and


gesture recognition

27

Automotive safety

Mobileye: Vision systems in high-end BMW, GM, Volvo models

Source: A. Shashua, S. Seitz

Computer vision and Applications

Factory inspection

Assistive technologies

Surveillance

Vision for robotics,


space exploration
Autonomous driving,
robot navigation
Sources: K. Grauman, L. Fei-Fei, S. Laznebick

Security

Computer vision and Applications

Kinect
Google
Goggles
EosSystems

1990

Autostich

2000

2010

A9
Kooaba

30

Computer vision and Applications

3D

EosSystems

Google
Goggles

2D
1990

2000

2010

31

Computer vision and Applications

3D

EosSystems

Google
Goggles

2D
1990

2000

2010

32

Computer vision

2D Recognition

3D Reconstruction

3D shape recovery
3D scene reconstruction
Camera localization
Pose estimation

Object detection
Texture classification
Target tracking
Activity recognition

33

Camera systems
Establish a mapping from 3D to 2D

Pinhole camera
Pinhole perspective
projection
f

f = focal length
c = center of the camera

x y
( x, y, z) (f , f )
z z
E

Projective camera
R,T

jw
P

kw
Ow
iw

Oc
P

P M Pw

KR T Pw
Internal parameters
External parameters

f = focal length
uo, vo = offset

, non-square pixels
= skew angle
R,T = rotation, translation

Properties of Projection
Points project to points
Lines project to lines
Distant objects look smaller

Properties of Projection
Angles are not preserved
Parallel lines meet!

Parallel lines in the world


intersect in the image at a
vanishing point

One-point perspective
Masaccio, Trinity,
Santa Maria
Novella, Florence,
1425-28

How to calibrate a camera


Estimate camera parameters such pose or focal length

Calibration Problem
Calibration rig

jC

P1 Pn with known positions in [Ow,iw,jw,kw]


p1, pn known positions in the image
Goal: compute intrinsic and extrinsic parameters

Calibration Problem
Calibration rig

image
jC

P1 Pn with known positions in [Ow,iw,jw,kw]


p1, pn known positions in the image
Goal: compute intrinsic and extrinsic parameters

Calibration Problem
Calibration rig

image
jC

How many correspondences do we need?


M has 11 unknown We need 11 equations 6 correspondences would do it

Calibration Procedure
Camera Calibration Toolbox for Matlab
J. Bouguet [1998-2000]
http://www.vision.caltech.edu/bouguetj/calib_doc/index.html#examples

Calibration Procedure

Calibration Procedure

Calibration Procedure

Calibration Procedure

Calibration Procedure

Calibration Procedure

Calibration Procedure

Once the camera is calibrated...

Pinhole perspective projection


P

Ow

M K R

-Internal parameters K are known


-R, T are known but these can only relate C to the calibration rig
Can I estimate P from the measurement p from a single image?
No - in general [P can be anywhere along the line defined by C and p]

Recovering structure from a single view


Pinhole perspective projection
P

Ow
Scene
Calibration rig

C
Camera K

Why is it so difficult?
Intrinsic ambiguity of the mapping from 3D to image (2D)

Recovering structure from a single view

Courtesy slide S. Lazebnik

Two eyes help!

Two eyes help!


X l l

X?
l

l'

x1

x2
K =known

K =known
R, T

O2

This is called triangulation

O1

Triangulation
Find X that minimizes
d ( x1 , M 1 X ) d ( x2 , M 2 X )
2

x1

O1

x2

O2

Stereo-view geometry
Correspondence: Given a point in one image,
how can I find the corresponding point x in
another one?
Camera geometry: Given corresponding points
in two images, find camera matrices, position
and pose.
Scene geometry: Find coordinates of 3D point
from its projection into 2 or multiple images.

Epipolar geometry
X

x1

x2
e1

e2
O2

O1
Epipolar Plane
Baseline
Epipolar Lines

Epipoles e1, e2
= intersections of baseline with image planes
= projections of the other camera center

Example: Converging image planes

e
e

Example: Parallel image planes


X

e1

x1

O1

x2

e2

O2
Baseline intersects the image plane at infinity
Epipoles are at infinity
Epipolar lines are parallel to x axis

Example: Parallel Image Planes


e at

e at

infinit
y

infinit
y

Epipolar Constraint

p F p2 0

T
1

p1

p2

e1
O1

e2
O2

F p2 is the epipolar line associated with p2 (l1 = F p2)


FT p1 is the epipolar line associated with x1 (l2 = FT p1)
F e2 = 0 and FT e1 = 0
F is 3x3 matrix; 7 DOF
F is singular (rank two)

Why F is useful?

l = FT x

- Suppose F is known
- No additional information about the scene and camera is given
- Given a point on left image, how can I find the corresponding point on right image?

Why F is useful?
F captures information about the epipolar geometry of
2 views + camera parameters
MORE IMPORTANTLY: F gives constraints on how the
scene changes under view point transformation
(without reconstructing the scene!)
Powerful tool in:
3D reconstruction
Multi-view object/scene matching

Multiple view geometry

Structure from motion problem


Xj

x1j

Mm

M1

xmj
x2j
M2

Given m images of n fixed 3D points


xij = Mi Xj , i = 1, , m,

j = 1, , n

Structure from motion problem


Xj

x1j

Mm

M1

xmj
x2j
M2

From the mxn correspondences xij, estimate:


m projection matrices Mi
motion
n 3D points Xj
structure

Structure from motion ambiguity


SFM can be solved up to a N-degree of freedom ambiguity
In the general case (nothing is known) the ambiguity is expressed
by an arbitrary affine or projective transformation

M i K i R i

x j Mi X j

Ti

M j H 1

H Xj

x j M i X j M i H -1 H X j
2010.12.18

69

Affine ambiguity

2010.12.18

70

Prospective ambiguity

2010.12.18

71

Self-calibration
Prior knowledge on cameras or scene can be used to add
constraints and remove ambiguities
Obtain metric reconstruction (up to scale)

Condition

N. Views

Constant internal parameters

Aspect ratio and skew known


Focal length and offset vary

Aspect ratio and skew known


Focal length and offset vary

skew =0, all other parameters vary

Bundle adjustment
Non-linear method for refining structure and motion
Minimizing re-projection error
It can be used before or after metric upgrade
Xj

i 1 j1

M1Xj

x3j

x1j
P1

M2Xj

x2j

M3Xj
P3

P2

E ( M, X) Dx ij , M i X j
m

Bundle adjustment
Non-linear method for refining structure and motion
Minimizing re-projection error
It can be used before or after metric upgrade

Advantages

E ( M, X) Dx ij , M i X j
m

i 1 j1

Handle large number of views


Handle missing data

Limitations
Large minimization problem (parameters grow with number of views)
Requires good initial condition
74

2010.12.18

Results and applications

Courtesy of Oxford Visual Geometry Group

Lucas & Kanade, 81


Chen & Medioni, 92
Debevec et al., 96
Levoy & Hanrahan, 96
Fitzgibbon & Zisserman,
98
Triggs et al., 99
Pollefeys et al., 99
Kutulakos & Seitz, 99

Levoy et al., 00
Hartley & Zisserman, 00
Dellaert et al., 00
Rusinkiewic et al., 02
Nistr, 04
Brown & Lowe, 04
Schindler et al, 04
Lourakis & Argyros, 04
Colombo et al. 05

Golparvar-Fard, et al. JAEI 10


Pandey et al. IFAC , 2010
Pandey et al. ICRA 2011
Microsofts PhotoSynth
Snavely et al., 06-08
Schindler et al., 08
Agarwal et al., 09
Frahm et al., 10

Results and applications


M. Pollefeys et al 98---

Results and applications

Noah Snavely, Steven M. Seitz, Richard Szeliski, "Photo tourism: Exploring photo collections in 3D," ACM
Transactions on Graphics (SIGGRAPH Proceedings),2006,

Computer vision

2D Recognition

3D Reconstruction

3D shape recovery
3D scene reconstruction
Camera localization
Pose estimation

Object detection
Texture classification
Target tracking
Activity recognition

78

Classification:
Does this image contain a building? [yes/no]

Yes!

Classification:
Is this an beach?

Image Search

Organizing photo collections

Detection:
Does this image contain a car? [where?]

car

Detection:
Which object does this image contain? [where?]
Building

clock

person

car

Detection:
Accurate localization (segmentation)

clock

Object detection is useful

Assistive technologies

Surveillance

Computational photography

Security

Assistive driving

Categorization vs Single instance


recognition
Which building is this? Marshall Field building in Chicago

Categorization vs Single instance


recognition
Where is the crunchy nut?

Applications of computer vision


Recognizing landmarks in
mobile platforms

+ GPS

Detection: Estimating object semantic


& geometric attributes
Object: Building, 45 pose,
8-10 meters away
It has bricks

Object: Person, back;


1-2 meters away

Object: Police car, side view, 4-5 m away

Activity or Event recognition


What are these people doing?

Visual Recognition
Design algorithms that are capable to
Classify images or videos
Detect and localize objects
Estimate semantic and geometrical
attributes
Classify human activities and events

Why is this challenging?

How many object categories are there?

Challenges: viewpoint variation

Michelangelo 1475-1564

slide credit: Fei-Fei, Fergus & Torralba

Challenges: illumination

image credit: J. Koenderink

Challenges: scale

slide credit: Fei-Fei, Fergus & Torralba

Challenges: deformation

Challenges:
occlusion

Magritte, 1957

slide credit: Fei-Fei, Fergus & Torralba

Challenges: background clutter

Kilmeny Niland. 1995

Challenges: intra-class variation

Basic properties
Representation
How to represent an object category; which
classification scheme?

Learning
How to learn the classifier, given training data

Recognition
How the classifier is to be used on novel data

Representation

Interest operators

Multiple interest operators

Dense, uniformly

Randomly

Image credits: F-F. Li, E. Nowak, J. Sivic

- Building blocks: Sampling strategies

Representation
- Building blocks: Choice of descriptors
[SIFT, HOG, codewords.]

Representation
Appearance only or location and appearance

Representation
Invariances
View point
Illumination
Occlusion
Scale
Deformation
Clutter
etc.

Representation
To handle intra-class variability, it is convenient to
describe an object categories using probabilistic
models
Object models: Generative vs Discriminative vs
hybrid

Object categorization:
the statistical viewpoint

p ( zebra | image)
vs.

p (no zebra|image)
Bayes rule:

p ( B|A) p ( A)
p ( A|B )
p( B)

p ( zebra | image)
p (image | zebra )
p ( zebra )

p (no zebra | image) p (image | no zebra ) p (no zebra )

Object categorization:
the statistical viewpoint

p ( zebra | image)
vs.

p (no zebra|image)
Bayes rule:

p ( B|A) p ( A)
p ( A|B )
p( B)

p ( zebra | image)
p (image | zebra )
p ( zebra )

p (no zebra | image) p (image | no zebra ) p (no zebra )


posterior ratio

likelihood ratio

prior ratio

Object categorization:
the statistical viewpoint
Discriminative methods model posterior
Generative methods model likelihood and
prior

Bayes rule:
p ( zebra | image)
p (image | zebra )
p ( zebra )

p (no zebra | image) p (image | no zebra ) p (no zebra )


posterior ratio

likelihood ratio

prior ratio

Discriminative models
Neural networks

Nearest neighbor

106 examples

Shakhnarovich, Viola, Darrell 2003


Berg, Berg, Malik 2005...
Support Vector Machines

Guyon, Vapnik, Heisele,


Serre, Poggio

Courtesy of Vittorio Ferrari


Slide credit: Kristen Grauman

LeCun, Bottou, Bengio, Haffner 1998


Rowley, Baluja, Kanade 1998

Latent SVM
Structural SVM

Felzenszwalb 00
Ramanan 03

Boosting

Viola, Jones 2001,


Torralba et al. 2004,
Opelt et al. 2006,

Slide adapted from Antonio Torralba

Generative models

Nave Bayes classifier

Csurka Bray, Dance & Fan, 2004

Hierarchical Bayesian topic models (e.g.


pLSA and LDA)

Object categorization: Sivic et al. 2005, Sudderth et al. 2005


Natural scene categorization: Fei-Fei et al. 2005

2D Part based models


-

Constellation models: Weber et al 2000; Fergus et al 200


Star models: ISM (Leibe et al 05)

3D part based models:


- multi-aspects: Sun, et al, 2009

Basic properties
Representation
How to represent an object category; which
classification scheme?

Learning
How to learn the classifier, given training data

Recognition
How the classifier is to be used on novel data

Learning
Learning parameters: What are you
maximizing? Likelihood (Gen.) or
performances on train/validation set (Disc.)

Learning
Learning parameters: What are you
maximizing? Likelihood (Gen.) or
performances on train/validation set (Disc.)
Level of supervision
Manual segmentation; bounding box; image labels;
noisy labels

Batch/incremental
Priors

Learning
Learning parameters: What are you
maximizing? Likelihood (Gen.) or
performances on train/validation set (Disc.)
Level of supervision
Manual segmentation; bounding box; image labels;
noisy labels

Batch/incremental
Priors
Training images:
Issue of overfitting
Negative images for
discriminative methods

Basic properties
Representation
How to represent an object category; which
classification scheme?

Learning
How to learn the classifier, given training data

Recognition
How the classifier is to be used on novel data

Recognition
Recognition task: classification, detection, etc..

Recognition
Recognition task
Search strategy: Sliding Windows

Viola, Jones 2001,

Simple
Computational complexity (x,y, S, , N of classes)
- BSW by Lampert et al 08
- Also, Alexe, et al 10

Recognition
Recognition task
Search strategy: Sliding Windows

Viola, Jones 2001,

Simple
Computational complexity (x,y, S, , N of classes)
- BSW by Lampert et al 08
- Also, Alexe, et al 10

Localization
Objects are not boxes

Segmentation
Bottom up segmentation

Malik et al. 01
Maire et al. 08
Felzenszwalb and Huttenlocher, 2004

Semantic segmentation

Duygulu et al. 02

Recognition
Recognition task
Search strategy: Sliding Windows

Viola, Jones 2001,

Simple
Computational complexity (x,y, S, , N of classes)
- BSW by Lampert et al 08
- Also, Alexe, et al 10

Localization
Objects are not boxes
Prone to false positive
Non max suppression:
Canny 86
.
Desai et al , 2009

Successful methods using sliding


windows
Subdivide scanning window
In each cell compute histogram of gradients
orientation.
Code available: http://pascal.inrialpes.fr/soft/olt/
[Dalal & Triggs, CVPR 2005]

- Subdivide scanning window


- In each cell compute histogram of
codewords of adjacent segments
Code available: http://www.vision.ee.ethz.ch/~calvin
[Ferrari & al, PAMI 2008]

Recognition
Recognition task
Search strategy : Probabilistic heat maps
Fergus et al 03
Leibe et al 04

Original

Recognition
Recognition task
Search strategy :
Hypothesis generation + verification

Recognition
Recognition task
Search strategy
Attributes

- It has metal
- it is glossy
- has wheels
Farhadi et al 09
Lampert et al 09
Wang & Forsyth 09

Savarese, 2007
Sun et al 2009
Liebelt et al., 08, 10
Farhadi et al 09

Category: car
Azimuth = 225
Zenith = 30

Recognizing 3D objects
Xiang & Savarese, 2012-2014

BED
CHAIR

CAR

TABLE

Recognition
Recognition task
Search strategy
Attributes
Context
Semantic:
Torralba et al 03
Rabinovich et al 07
Gupta & Davis 08
Heitz & Koller 08
L-J Li et al 08
Bang & Fei-Fei 10

Geometric
Hoiem, et al 06
Gould et al 09
Bao, Sun, Savarese 10

Recognition in context

Labelme dataset [Russell et al., 08]

Bao, Sun, Savarese CVPR 2010;


BMVC 2010;
CIVC 2011 (editor choice)
IJCV 2012

Recognition in context

Labelme dataset [Russell et al., 08]

Bao, Sun, Savarese CVPR 2010;


BMVC 2010;
CIVC 2011 (editor choice)
IJCV 2012

Recognition
Recognition task
Search strategy
Attributes
Context
Tracking

Xiang & Savarese, 2012-2014

State-of-the-art

Object Tracking

131

Object tracking from Lidar


Held, Thrun & Savarese, RSS 2014

132

Current state of computer vision

2D Recognition

3D Reconstruction

3D shape recovery
3D scene reconstruction
Camera localization
Pose estimation

Object detection
Texture classification
Target tracking
Activity recognition

Perceiving the World in 3D!


133

Sensibility as human perception

Biederman, Mezzanotte and Rabinowitz, 1982

134

Sensibility as human perception


where pathway
(dorsal stream)

V1
what pathway
(ventral stream)
135

Sensibility as human perception


where pathway
(dorsal stream)

Pre-frontal
cortex

V1
what pathway
(ventral stream)

136

From images to the 3D scenes


Choi & Savarese, 2013

137

From images to the 3D scenes


Choi & Savarese, 2013

A 3DGP encodes geometric and semantic relationships between groups of objects and space
elements which frequently co-occur in spatially consistent configurations.

138

From images to the 3D scenes


Choi & Savarese, 2013

Training Dataset

3DGPs
139

From images to the 3D scenes


Choi & Savarese, 2013

Sofa, Coffee Table, Chair, Bed, Dining Table, Side Table


Estimated Layout

3D Geometric Phrases
140

From images to the 3D scenes


Choi & Savarese, 2013

Sofa, Coffee Table, Chair, Bed, Dining Table, Side Table


Estimated Layout

3D Geometric Phrases
141

From images to 3D scenes

Bao & Savarese, 2011-2013

Car Person Tree Sky


Street Building 142
Else

From images to 3D scenes

Bao & Savarese, 2011

Bao & Savarese, 2011-2013

Car Person Tree Sky


Street Building 143
Else

From videos to 3D dynamic scenes


Choi & Savarese, 2011-2014

Monocular cameras
Un-calibrated cameras
Arbitrary motion

Highly cluttered scenes


Occlusion
Background clutter

Almost in real time!

From videos to 3D dynamic scenes


Choi & Savarese, 2011-2014

Monocular cameras
Un-calibrated cameras
Arbitrary motion

Highly cluttered scenes


Occlusion
Background clutter

Almost in real time!

Summary
3D physical
environment
Sensors

Objects

147

You might also like