Vision

CS223A
Vision in Robotics
Professor Silvio Savarese

Computational Vision and Geometry Lab
Silvio Savarese
21-Jan-15
Sensing is the future
Everything is a sensor
Modern vision sensors

w/ gravity
night
thermal
Kinect
Sensing is not the hard problem

Intelligent understanding of the
sensing data is the challenge!
What does it mean intelligent
understanding of the sensing data?
Computer vision
Computer vision studies the tools and theories that enable the design of machines
that can extract useful information from imagery data
(images and videos) toward the goal of interpreting the world
Extract
information
Interpretation
Sensing device
Computational
device
Information: features, 3D structure, motion flows, etc

Interpretation: recognize objects, scenes, actions, events
Computer vision and Applications
EosSystems
1990
2000
2010
12
Fingerprint biometrics
Augmentation with 3D
computer graphics
14
3D object prototyping
EosSystems
Photomodeler
15

New features detector/descriptors
CV leverages machine learning
Autostich
EosSystems
1990
2000
2010
16
Face detection
Face detection
Web applications
Photometria
19
Panoramic Photography
kolor
3D modeling of landmarks
21
Large scale image matching

Efficient SLAM/SFM
Better clouds
More bandwidth
Increase computational power
Kinect
Google
Goggles
Autostich
EosSystems
1990
2000
A9
Kooaba
2010
22
Image search engines
Movies, news, sports
Visual search and

landmarks recognition
Google Goggles
24
Visual search and

landmarks recognition
25
Augmented reality
26
Motion sensing and

gesture recognition
27
Automotive safety
Mobileye: Vision systems in high-end BMW, GM, Volvo models
Source: A. Shashua, S. Seitz
Factory inspection
Assistive technologies
Surveillance
Vision for robotics,

space exploration
Autonomous driving,
robot navigation
Sources: K. Grauman, L. Fei-Fei, S. Laznebick
Security
Kinect
Google
Goggles
EosSystems
1990
Autostich
2000
2010
A9
Kooaba
30
3D
EosSystems
Google
Goggles
2D
1990
2000
2010
31
3D
EosSystems
Google
Goggles
2D
1990
2000
2010
32
Computer vision
2D Recognition
3D Reconstruction
3D shape recovery
3D scene reconstruction
Camera localization
Pose estimation
Object detection
Texture classification
Target tracking
Activity recognition
33
Camera systems
Establish a mapping from 3D to 2D
Pinhole camera
Pinhole perspective
projection
f
f = focal length
c = center of the camera
x y
( x, y, z) (f , f )
z z
E
Projective camera
R,T
jw
P
kw
Ow
iw
Oc
P
P M Pw
KR T Pw
Internal parameters
External parameters
f = focal length
uo, vo = offset
, non-square pixels
= skew angle
R,T = rotation, translation
Properties of Projection
Points project to points
Lines project to lines
Distant objects look smaller
Properties of Projection
Angles are not preserved
Parallel lines meet!
Parallel lines in the world

intersect in the image at a
vanishing point
One-point perspective
Masaccio, Trinity,
Santa Maria
Novella, Florence,
1425-28
How to calibrate a camera

Estimate camera parameters such pose or focal length
Calibration Problem
Calibration rig
jC
P1 Pn with known positions in [Ow,iw,jw,kw]

p1, pn known positions in the image
Goal: compute intrinsic and extrinsic parameters
Calibration Problem
Calibration rig
image
jC
P1 Pn with known positions in [Ow,iw,jw,kw]

p1, pn known positions in the image
Goal: compute intrinsic and extrinsic parameters
Calibration Problem
Calibration rig
image
jC
How many correspondences do we need?

M has 11 unknown We need 11 equations 6 correspondences would do it
Calibration Procedure
Camera Calibration Toolbox for Matlab
J. Bouguet [1998-2000]
http://www.vision.caltech.edu/bouguetj/calib_doc/index.html#examples
Once the camera is calibrated...
Pinhole perspective projection

P
Ow
M K R
-Internal parameters K are known

-R, T are known but these can only relate C to the calibration rig
Can I estimate P from the measurement p from a single image?
No - in general [P can be anywhere along the line defined by C and p]
Recovering structure from a single view

Pinhole perspective projection
P
Ow
Scene
Calibration rig
C
Camera K
Why is it so difficult?
Intrinsic ambiguity of the mapping from 3D to image (2D)
Recovering structure from a single view
Courtesy slide S. Lazebnik
Two eyes help!
Two eyes help!

X l l
X?
l
l'
x1
x2
K =known
K =known
R, T
O2
This is called triangulation
O1
Triangulation
Find X that minimizes
d ( x1 , M 1 X ) d ( x2 , M 2 X )
2
x1
O1
x2
O2
Stereo-view geometry
Correspondence: Given a point in one image,
how can I find the corresponding point x in
another one?
Camera geometry: Given corresponding points
in two images, find camera matrices, position
and pose.
Scene geometry: Find coordinates of 3D point
from its projection into 2 or multiple images.
Epipolar geometry
X
x1
x2
e1
e2
O2
O1
Epipolar Plane
Baseline
Epipolar Lines
Epipoles e1, e2
= intersections of baseline with image planes
= projections of the other camera center
Example: Converging image planes
e
e
Example: Parallel image planes

X
e1
x1
O1
x2
e2
O2
Baseline intersects the image plane at infinity
Epipoles are at infinity
Epipolar lines are parallel to x axis
Example: Parallel Image Planes

e at
e at
infinit
y
infinit
y
Epipolar Constraint
p F p2 0
T
1
p1
p2
e1
O1
e2
O2
F p2 is the epipolar line associated with p2 (l1 = F p2)

FT p1 is the epipolar line associated with x1 (l2 = FT p1)
F e2 = 0 and FT e1 = 0
F is 3x3 matrix; 7 DOF
F is singular (rank two)
Why F is useful?
l = FT x
- Suppose F is known
- No additional information about the scene and camera is given
- Given a point on left image, how can I find the corresponding point on right image?
Why F is useful?
F captures information about the epipolar geometry of
2 views + camera parameters
MORE IMPORTANTLY: F gives constraints on how the
scene changes under view point transformation
(without reconstructing the scene!)
Powerful tool in:
3D reconstruction
Multi-view object/scene matching
Multiple view geometry
Structure from motion problem

Xj
x1j
Mm
M1
xmj
x2j
M2
Given m images of n fixed 3D points

xij = Mi Xj , i = 1, , m,
j = 1, , n
Structure from motion problem

Xj
x1j
Mm
M1
xmj
x2j
M2
From the mxn correspondences xij, estimate:

m projection matrices Mi
motion
n 3D points Xj
structure
Structure from motion ambiguity

SFM can be solved up to a N-degree of freedom ambiguity
In the general case (nothing is known) the ambiguity is expressed
by an arbitrary affine or projective transformation
M i K i R i
x j Mi X j
Ti
M j H 1
H Xj
x j M i X j M i H -1 H X j
2010.12.18
69
Affine ambiguity
2010.12.18
70
Prospective ambiguity
2010.12.18
71
Self-calibration
Prior knowledge on cameras or scene can be used to add
constraints and remove ambiguities
Obtain metric reconstruction (up to scale)
Condition
N. Views
Constant internal parameters
Aspect ratio and skew known

Focal length and offset vary
Aspect ratio and skew known

Focal length and offset vary
skew =0, all other parameters vary
Bundle adjustment
Non-linear method for refining structure and motion
Minimizing re-projection error
It can be used before or after metric upgrade
Xj
i 1 j1
M1Xj
x3j
x1j
P1
M2Xj
x2j
M3Xj
P3
P2
E ( M, X) Dx ij , M i X j
m
Bundle adjustment
Non-linear method for refining structure and motion
Minimizing re-projection error
It can be used before or after metric upgrade
Advantages
E ( M, X) Dx ij , M i X j
m
i 1 j1
Handle large number of views

Handle missing data
Limitations
Large minimization problem (parameters grow with number of views)
Requires good initial condition
74
2010.12.18
Results and applications
Courtesy of Oxford Visual Geometry Group
Lucas & Kanade, 81

Chen & Medioni, 92
Debevec et al., 96
Levoy & Hanrahan, 96
Fitzgibbon & Zisserman,
98
Triggs et al., 99
Pollefeys et al., 99
Kutulakos & Seitz, 99
Levoy et al., 00
Hartley & Zisserman, 00
Dellaert et al., 00
Rusinkiewic et al., 02
Nistr, 04
Brown & Lowe, 04
Schindler et al, 04
Lourakis & Argyros, 04
Colombo et al. 05
Golparvar-Fard, et al. JAEI 10

Pandey et al. IFAC , 2010
Pandey et al. ICRA 2011
Microsofts PhotoSynth
Snavely et al., 06-08
Schindler et al., 08
Agarwal et al., 09
Frahm et al., 10

M. Pollefeys et al 98---
Noah Snavely, Steven M. Seitz, Richard Szeliski, "Photo tourism: Exploring photo collections in 3D," ACM
Transactions on Graphics (SIGGRAPH Proceedings),2006,
Computer vision
2D Recognition
3D Reconstruction
3D shape recovery
Camera localization
Pose estimation
Object detection
Target tracking
78
Classification:
Does this image contain a building? [yes/no]
Yes!
Classification:
Is this an beach?
Image Search
Organizing photo collections
Detection:
Does this image contain a car? [where?]
car
Detection:
Which object does this image contain? [where?]
Building
clock
person
car
Detection:
Accurate localization (segmentation)
clock
Object detection is useful
Assistive technologies
Surveillance
Computational photography
Security
Assistive driving
Categorization vs Single instance

recognition
Which building is this? Marshall Field building in Chicago
Categorization vs Single instance

recognition
Where is the crunchy nut?
Applications of computer vision

Recognizing landmarks in
mobile platforms
+ GPS
Detection: Estimating object semantic

& geometric attributes
Object: Building, 45 pose,
8-10 meters away
It has bricks
Object: Person, back;

1-2 meters away
Object: Police car, side view, 4-5 m away
Activity or Event recognition

What are these people doing?
Visual Recognition
Design algorithms that are capable to
Classify images or videos
Detect and localize objects
Estimate semantic and geometrical
attributes
Classify human activities and events
Why is this challenging?
How many object categories are there?
Challenges: viewpoint variation
Michelangelo 1475-1564
slide credit: Fei-Fei, Fergus & Torralba
Challenges: illumination
image credit: J. Koenderink
Challenges: scale
Challenges: deformation
Challenges:
occlusion
Magritte, 1957
Challenges: background clutter
Kilmeny Niland. 1995
Challenges: intra-class variation
Basic properties
Representation
How to represent an object category; which
classification scheme?
Learning
How to learn the classifier, given training data
Recognition
How the classifier is to be used on novel data
Representation
Interest operators
Multiple interest operators
Dense, uniformly
Randomly
Image credits: F-F. Li, E. Nowak, J. Sivic
- Building blocks: Sampling strategies
Representation
- Building blocks: Choice of descriptors
[SIFT, HOG, codewords.]
Representation
Appearance only or location and appearance
Representation
Invariances
View point
Illumination
Occlusion
Scale
Deformation
Clutter
etc.
Representation
To handle intra-class variability, it is convenient to
describe an object categories using probabilistic
models
Object models: Generative vs Discriminative vs
hybrid
Object categorization:
the statistical viewpoint
p ( zebra | image)
vs.
p (no zebra|image)
Bayes rule:
p ( B|A) p ( A)
p ( A|B )
p( B)
p ( zebra | image)
p (image | zebra )
p ( zebra )
p (no zebra | image) p (image | no zebra ) p (no zebra )
p ( zebra | image)
vs.
p (no zebra|image)
Bayes rule:
p ( B|A) p ( A)
p ( A|B )
p( B)
p ( zebra | image)
p (image | zebra )
p ( zebra )

posterior ratio
likelihood ratio
prior ratio
Discriminative methods model posterior
Generative methods model likelihood and
prior
Bayes rule:
p ( zebra | image)
p (image | zebra )
p ( zebra )

posterior ratio
likelihood ratio
prior ratio
Discriminative models
Neural networks
Nearest neighbor
106 examples
Shakhnarovich, Viola, Darrell 2003

Berg, Berg, Malik 2005...
Support Vector Machines
Guyon, Vapnik, Heisele,

Serre, Poggio
Courtesy of Vittorio Ferrari

Slide credit: Kristen Grauman
LeCun, Bottou, Bengio, Haffner 1998

Rowley, Baluja, Kanade 1998
Latent SVM
Structural SVM
Felzenszwalb 00
Ramanan 03
Boosting
Viola, Jones 2001,

Torralba et al. 2004,
Opelt et al. 2006,
Slide adapted from Antonio Torralba
Generative models
Nave Bayes classifier
Csurka Bray, Dance & Fan, 2004
Hierarchical Bayesian topic models (e.g.

pLSA and LDA)
Object categorization: Sivic et al. 2005, Sudderth et al. 2005

Natural scene categorization: Fei-Fei et al. 2005
2D Part based models

-
Constellation models: Weber et al 2000; Fergus et al 200

Star models: ISM (Leibe et al 05)
3D part based models:

- multi-aspects: Sun, et al, 2009
Basic properties
Representation
Learning
Recognition
Learning
Learning parameters: What are you
maximizing? Likelihood (Gen.) or
performances on train/validation set (Disc.)
Learning
Level of supervision
Manual segmentation; bounding box; image labels;
noisy labels
Batch/incremental
Priors
Learning
Level of supervision
Manual segmentation; bounding box; image labels;
noisy labels
Batch/incremental
Priors
Training images:
Issue of overfitting
Negative images for
discriminative methods
Basic properties
Representation
Learning
Recognition
Recognition
Recognition task: classification, detection, etc..
Recognition
Recognition task
Search strategy: Sliding Windows
Viola, Jones 2001,
Simple
Computational complexity (x,y, S, , N of classes)
- BSW by Lampert et al 08
- Also, Alexe, et al 10
Recognition
Recognition task
Viola, Jones 2001,
Simple
Localization
Objects are not boxes
Segmentation
Bottom up segmentation
Malik et al. 01
Maire et al. 08
Felzenszwalb and Huttenlocher, 2004
Semantic segmentation
Duygulu et al. 02
Recognition
Recognition task
Viola, Jones 2001,
Simple
Localization
Objects are not boxes
Prone to false positive
Non max suppression:
Canny 86
.
Desai et al , 2009
Successful methods using sliding

windows
Subdivide scanning window
In each cell compute histogram of gradients
orientation.
Code available: http://pascal.inrialpes.fr/soft/olt/
[Dalal & Triggs, CVPR 2005]
- Subdivide scanning window

- In each cell compute histogram of
codewords of adjacent segments
Code available: http://www.vision.ee.ethz.ch/~calvin
[Ferrari & al, PAMI 2008]
Recognition
Recognition task
Search strategy : Probabilistic heat maps
Fergus et al 03
Leibe et al 04
Original
Recognition
Recognition task
Search strategy :
Hypothesis generation + verification
Recognition
Recognition task
Search strategy
Attributes
- It has metal
- it is glossy
- has wheels
Farhadi et al 09
Lampert et al 09
Wang & Forsyth 09
Savarese, 2007
Sun et al 2009
Liebelt et al., 08, 10
Farhadi et al 09
Category: car
Azimuth = 225
Zenith = 30
Recognizing 3D objects
Xiang & Savarese, 2012-2014
BED
CHAIR
CAR
TABLE
Recognition
Recognition task
Search strategy
Attributes
Context
Semantic:
Torralba et al 03
Rabinovich et al 07
Gupta & Davis 08
Heitz & Koller 08
L-J Li et al 08
Bang & Fei-Fei 10
Geometric
Hoiem, et al 06
Gould et al 09
Bao, Sun, Savarese 10
Recognition in context
Labelme dataset [Russell et al., 08]
Bao, Sun, Savarese CVPR 2010;

BMVC 2010;
CIVC 2011 (editor choice)
IJCV 2012
Recognition in context
Labelme dataset [Russell et al., 08]
Bao, Sun, Savarese CVPR 2010;

BMVC 2010;
CIVC 2011 (editor choice)
IJCV 2012
Recognition
Recognition task
Search strategy
Attributes
Context
Tracking
Xiang & Savarese, 2012-2014
State-of-the-art
Object Tracking
131
Object tracking from Lidar

Held, Thrun & Savarese, RSS 2014
132
Current state of computer vision
2D Recognition
3D Reconstruction
3D shape recovery
Camera localization
Pose estimation
Object detection
Target tracking
Perceiving the World in 3D!

133
Sensibility as human perception
Biederman, Mezzanotte and Rabinowitz, 1982
134

where pathway
(dorsal stream)
V1
what pathway
(ventral stream)
135

where pathway
(dorsal stream)
Pre-frontal
cortex
V1
what pathway
(ventral stream)
136
From images to the 3D scenes

Choi & Savarese, 2013
137

A 3DGP encodes geometric and semantic relationships between groups of objects and space
elements which frequently co-occur in spatially consistent configurations.
138

Training Dataset
3DGPs
139

Sofa, Coffee Table, Chair, Bed, Dining Table, Side Table

Estimated Layout
3D Geometric Phrases
140

Sofa, Coffee Table, Chair, Bed, Dining Table, Side Table

Estimated Layout
3D Geometric Phrases
141
From images to 3D scenes
Bao & Savarese, 2011-2013
Car Person Tree Sky

Street Building 142
Else
From images to 3D scenes
Bao & Savarese, 2011
Bao & Savarese, 2011-2013
Car Person Tree Sky

Street Building 143
Else
From videos to 3D dynamic scenes

Choi & Savarese, 2011-2014
Monocular cameras
Un-calibrated cameras
Arbitrary motion
Highly cluttered scenes

Occlusion
Background clutter
Almost in real time!
From videos to 3D dynamic scenes

Choi & Savarese, 2011-2014
Monocular cameras
Un-calibrated cameras
Arbitrary motion
Highly cluttered scenes

Occlusion
Background clutter
Almost in real time!
Summary
3D physical
environment
Sensors
Objects
147

Vision

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vision

Uploaded by

Copyright:

Available Formats

CS223A

Professor Silvio Savarese

Sensing is the future

Sensing is the future

Sensing is the future

Sensing is the future

Modern vision sensors

Sensing is not the hard problem

Information: features, 3D structure, motion flows, etc

Computer vision and Applications

Computer vision and Applications

Computer vision and Applications

Large scale image matching

Image search engines

Movies, news, sports

Visual search and

Visual search and

Motion sensing and

Mobileye: Vision systems in high-end BMW, GM, Volvo models

Source: A. Shashua, S. Seitz

Computer vision and Applications

Vision for robotics,

Computer vision and Applications

Computer vision and Applications

Computer vision and Applications

Parallel lines in the world

How to calibrate a camera

P1 Pn with known positions in [Ow,iw,jw,kw]

P1 Pn with known positions in [Ow,iw,jw,kw]

How many correspondences do we need?

Once the camera is calibrated...

Pinhole perspective projection

-Internal parameters K are known

Recovering structure from a single view

Recovering structure from a single view

Courtesy slide S. Lazebnik

Two eyes help!

Two eyes help!

This is called triangulation

Example: Converging image planes

Example: Parallel image planes

Example: Parallel Image Planes

F p2 is the epipolar line associated with p2 (l1 = F p2)

Multiple view geometry

Structure from motion problem

Given m images of n fixed 3D points

Structure from motion problem

From the mxn correspondences xij, estimate:

Structure from motion ambiguity

Constant internal parameters

Aspect ratio and skew known

Aspect ratio and skew known

skew =0, all other parameters vary

Handle large number of views

Results and applications

Courtesy of Oxford Visual Geometry Group

Lucas & Kanade, 81

Golparvar-Fard, et al. JAEI 10

Results and applications

Results and applications

Organizing photo collections

Object detection is useful

Categorization vs Single instance

Categorization vs Single instance