Robot Learning With Implicit Representations

Robot Learning with Implicit Representations
Perception, Action, and Simulation
Animesh Garg
RSS 2022 Workshop
What is Implicit Neural Representation?
3D Representations in Visual Computing

ü Discrete Representations
ü Intuitive Spatial Map
✗ Memory
✗ Arbitrary Topologies
✗ Connectivity Structures
Voxels Point Clouds Mesh
Occupancy Nets, CVPR 20

ü Continuous Representations
ü “Infinite” Spatial Resolution
ü Memory depends on signal complexity
✗ Not Analytically Tractable
Implicit
Voxels Point Clouds Mesh
Representation
Occupancy Nets, CVPR 20

Coordinates Values
𝑓: ℝ! → ℝ"
𝑉 = 𝑓(𝑟)
Images:
ü𝑟:Continuous
(x, y), 𝑉: (r, Representations
g, b)
ü “Infinite” Spatial Resolution
ü3DMemory
Scenesdepends
and Shapes
on signal
(as incomplexity
NeRFs)
𝑟: (𝑥, 𝑦, 𝑧, 𝜃, 𝜙), 𝑉: (𝑟, 𝑔, 𝑏, 𝜎)
✗ Not Analytically Tractable Implicit
Trajectories Representation
𝑟: (𝑞)"! generalized coordinates
𝑉: utility function
Implicit Representations in Visual Computing
Shape reconstruction Rendering Novel view synthesis
Occupancy Networks: Learning 3D Reconstruction in Function Space. In CVPR, 2019.

Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes. In CVPR, 2021.
NeRF: Representing Scenes as Neural Radance Fields for View Synthesis. In ECCV, 2020.
Implicit Neural Representations in Robotics
Grasp detection Visuomotor control Generalization in Manipulation
Synergies Between Affordance and Geometry: 6-DOF Grasp Detection via Implicit Representations. In RSS, 2021.
3D Neural Scene Representations for Visuomotor Control. In CoRL, 2021.
Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation. In ICRA, 2022.
Algorithmic Development (perception and control)
+ Improved Simulation for Contact-rich Manipulation
Perception Action Simulation

Objects & Poses Trajectories & Differentiable
Value Functions contact sim
Work in progress

NERF 2 NERF
Registering Partially Overlapping NeRFs
Lily Goli, Daniel Rebain, Animesh Garg, Andrea Tagliasacchi
9
Motivation 2 NeRF-2-NeRF
Registration Estimated Object
Poses
Collected NeRFs
In Memory
3
Combination
&
Rendering &
…
Interactive &
Interactable
Simulation
1 NeRF
Acquisition
What are Neural Radiance Fields (NeRFs)?
Training an MLP Composition & Rendering
Registration Problem in NeRFs
Unsupervised Training to Find T - Objective Function?
View/RGB Difference Correspondence Difference
Distance between positions of

Loss Function = Error (NeRF1(T*R), NeRF2(R)) + corresponding point
coordinates after applying T
Challenges:
Even if learned T is optimal: Error between rendered images We need a robust function applied to
is NOT zero! => MSE
The scenes are only partially overlapping. To make it more robust
Corresponding points lie in 2D space of rendered images We derive equivalent 3D Points

Transformation T lies in 3D space =>
using Triangulation
Focusing on First Loss Term (View Difference)
Robust Registration of 2D views. Modeling the problem in 2D setting:
• Delta will not be zero even if TL = TG, in some query points!

Just focus on object of interest -> many loss functions, mostly use manual thresholding
Barron CVPR 2019

Registration via Radiance Matching
Random
view 1
Random
view 2
NeRF 1 without transform NeRF 1 with transform Overlap of fixed NeRF2 Fixed NeRF 2 (target)
with sample points with sample points and moving NeRF1
Different Lightings (Failure Case)
If we use only radiance for registration,
then different lighting models on the object fail!
• Fix: Use Geometry features rather than radiance
Sampling in the moving NeRF target view (uniformly lighter)

Geometry Network via Distillation
We train a 3 layer network supervised by:
Results
Random
view 1
Random
view 2
(moving) NeRF 1 - initial NeRF 1 - registration Overlay of fixed NeRF 2 (fixed) NeRF 2 - target
pose iterations and moving NeRF 1

NEURAL MOTION FIELDS
Encoding Grasp Trajectories as Implicit Value Functions
Yun-Chun Chen, Adithya Murali, Bala Sundaralingam,

Wei Yang, Animesh Garg, Dieter Fox
21
Existing Grasping Methods
Grasp pose detection
6-DOF GraspNet: Variational Grasp Generation for Object Manipulation. In ICCV, 2019.
Find inverse kinematic solutions
Plan a collision-free trajectory
Plan a collision-free trajectory
Execute the open-loop trajectory
+ Table-top object grasping
+ Grasping in clutter
+ Bin-picking
Contact-GraspNet: Efficient 6-DOF Grasp Generation in Cluttered Scenes. In ICRA, 2021.
6-DOF Grasping for Target-driven Object Manipulation in Clutter. In ICRA, 2020.
+ Bin-picking
- Infer a finite discrete number of grasps
+ Bin-picking
- Infer a finite discrete number of grasps
Grasp affordances are a continuous manifold
Neural Motion Fields
Goal:
Learn a value function that can be used to plan a trajectory for grasping
P ). We propose Neural Motion Fields, which consists of cross-entropy loss function, which is defined as

modules: a path length module and a collision module.
length prediction. As shown in Figure 1, the path length
Lcollision = pgt (g, P ) log ppred (g, P )
+ (1 pgt (g, P )) log(1 ppred (g, P )),
(
ule first takes as input the object point cloud P and uses the
cloudGoal:
encoder Epath-length to encode a feature embedding where ppred (g, P ) is the predicted collision probability a
ength =Learn a value
Epath-length (P ) 2function
Rd , wherethat
d iscan be used ofto plan
the dimension ) is the ground
pgt (g,aPtrajectory truth.
for grasping
eature fpath-length . Then, the point cloud feature fpath-length B. Generating Grasp Motion
he gripper pose g are concatenated and passed to the path Given the path length value function represented by the pa
h prediction network Fpath-length to predict the path length length module and the collision value function represented
(g, P ) for the gripper pose g.1 the collision module, we formulate the grasp cost Cgrasp as
Value function:
train the path length module, we adopt an `1 loss function,
Map aas gripper pose to its path length to a grasp
h is defined Cgrasp (gt , P ) = (1 V (gt , P )) + C(gt , P ), (
Lpath-length = kVpred (g, P ) Vgt (g, P )k1 , (2) where V (gt , P ) is the predicted path length for the gripp
pose gt , C(gt , P ) is the collision cost of the gripper pose
e Vpred (g, P ) denotes the predicted path length of the computed by thresholding p(gt , P ) ⌧ , ⌧ is a hyperparamet
er pose g and Vgt (g, P ) denotes the ground truth. and P is the object point cloud. In our work, we set ⌧ = 0.
e visualize the learned value function using a cost map We then optimize the grasp cost Cgrasp along with t
lization as shown in Figure 3. We show two cost maps. In cost Cstorm to ensure smooth collision-free motions usi
cost map, we select a grasp pose. We keep the orientation STORM [1], which is a GPU-based MPC framework:
vary the x and y positions of the grasp pose to compose
. We then query the path lengths of the composed poses min Cstorm (q) + Cgrasp (
ẍt2[0,H]
the learned model. Each input pose is represented by a Additional details on C is available in [1].
yP ). We2) We show that this learned object-centric representations
propose
allows Neuralgrasp
reactive Motion Fields, which
manipulation using consists
MPC of cross-entropy loss function, which is defined as
[1].
n
modules:
d
s
length
a path length module and a collision module.
prediction. II.
As P ROBLEM S TATEMENT
shown in Figure 1, the path length
Lcollision = pgt (g, P ) log ppred (g, P )
+ (1 pgt (g, P )) log(1 ppred (g, P )),
(
ule Value
n first function
takes as inputlearning
the object forpoint
grasping.
cloud We P and
are uses the in
interested
scloud Goal:
learning
encoder a model that can
Epath-length be used
to encode to plan embedding
a feature a kinematicallywhere ppred (g, P ) is the predicted collision probability a
=Learn
engthfeasible a value
trajectory
Epath-length 2function
(P ) for Rthe, wherethatd iscan be used ofto plan
d robot to execute to grasp an object.
the dimension ) is the ground
pgt (g,aPtrajectory truth.
for grasping
s Specifically,
eature fpath-length .we cast the
Then, thispoint
task cloud
as a value
featurefunction learning
fpath-length B. Generating Grasp Motion
problem.
he gripper
s, poseWegassume that we areand
are Nconcatenated given
passeda segmented
to the pathobject Given the path length value function represented by the pa
point cloud
ht prediction P 2 R F , wheretoNpredict
network
⇥3
is the number
the pathoflength
points in a
path-length
point cloud, and a gripper pose g 2 SE(3). The value function length module and the collision value function represented
l
(g, P ) for the gripper pose g. 1
ytrain Value
V (g, function:
P ) describes how far the gripper pose g is from a grasp the collision module, we formulate the grasp cost Cgrasp as
the path length module, we adopt an `1 loss function,
s, on the object.
Map We use
aas gripper the path
pose length
to its pathof length
a gripper topose to
a grasp
h is represent
defined Cgrasp (gt , P ) = (1 V (gt , P )) + C(gt , P ), (
k the value function.
p Gripper pose=path
Lpath-length kVpredlength.
(g, P )As Vshown
gt (g, Pin
)kFigure
1, givenwhere
2a, (2) a V (gt , P ) is the predicted path length for the gripp
n trajectory {gi }ti=0 , where g0 denotes the end pose (grasp pose) pose gt , C(gt , P ) is the collision cost of the gripper pose
ef Vand
pred (g,
gt P ) denotes
denotes the pose,
the start predicted
the pathpath length
length of start
of the the pose
computed by thresholding p(gt , P ) ⌧ , ⌧ is a hyperparamet
ger pose g and Vas
gt (denoted gt (g, Pt)))denotes
V (g is definedthe as
ground truth.
the cumulative sum ofand P is the object point cloud. In our work, we set ⌧ = 0.
el visualize
the average the distance
learned between
value function usinggripper
two adjacent a costposes
map [16], We then optimize the grasp cost C
Grasp pose !! grasp along with t
lization
t which Gripper
ascan bepose
shown path
in Figure
expressed 3. length:
by We show two cost maps. In cost C
storm to ensure smooth collision-free motions usi
mcost map, we select Xt 1 a grasp pose. We keep the orientation STORM [1], which is a GPU-based MPC framework:
1 X
vary
s the
V (gxt )and
= y positionsk(R of ithe
x +grasp
Ti ) pose
(Ri+1tox+ compose
Ti+1 )k, (1)
t. We then query
m
i=0the x2Mpath lengths of the composed poses min Cstorm (q) + Cgrasp (
ẍt2[0,H]
thewhere
learned is the rotation
Ri model. Each input of the
posegripper pose gi , by
is represented Ti ais the
Additional details on C
Start pose !"
is available in [1].
e path length value function represented by the path
Grasp Motion Generation

ule and the collision value function represented by
n module, we formulate the grasp cost Cgrasp as
grasp (gt , P ) = (1 V (gt , P )) + C(gt , P ), (4)
t, P ) is the predicted path length for the gripper

(gt , P ) is the collision cost of the gripper pose gt
Query gripper
y thresholding p(gt , P ) poses
⌧ , ⌧ is and optimize the value
a hyperparameter,
function
e object using
point cloud. a sampling-based
In our work, we set ⌧ = 0.25.MPC
optimize the grasp cost Cgrasp along with the
framework (MPPI)
to ensure smooth collision-free motions using
], which is a GPU-based MPC framework:
min Cstorm (q) + Cgrasp (5)
ẍt2[0,H]
details on Cstorm is available in [1].

IV. E XPERIMENTS
mental Setup
e experiment with four box objects from the dataset
y [8] and one bowl object from the ACRONYM
We first subsample a set of 16 grasp poses aroundSTORM: An Integrated Framework for Fast Joint-Space Model-Predictive Control for Reactive Manipulation. In CoRL, 2022.
Ablation Study on Number of Trajectories
Static object poses Dynamic object poses
More data helps with fine-grained rotation error with non-stationary objects
Ablation Study on Number of Anchor Grasps
Anchor grasps
More data helps with snapping to multi-modal grasp prediction

Floating Object Demo

GRASP’D
Differentiable Contact-Rich Grasp Synthesis
Dylan Turpin, Liquan Wang, Eric Heiden, Yun-Chun Chen,

Miles Macklin, Stavros Tsogkas, Sven Dickinson, Animesh Garg
37
Motivation
Goal: Make SDF-based contact forces friendly to gradient-based optimization.
Motivation
Why? Planning in high-dimensional contact-rich scenarios,
e.g., robotic grasping and manipulation with multi-finger hands.
Motivation
Challenges
1. Contact sparsity
Only a fraction of possible contacts are active (in collision) at a given time.
Inactive contacts have no gradient.
Motivation
Challenges
1. Contact sparsity
Only a fraction of possible contacts are active (in collision) at a given time.
Inactive contacts have no gradient.
Can’t follow gradient to create new contacts.
Motivation
Challenges
1. Contact sparsity
Motivation
Challenges
1. Contact sparsity
2. Local flatness
Often compute ground-truth SDF from mesh.
If closest point is on triangle face, surface normal gradient is 0.
Motivation
Challenges
1. Contact sparsity
2. Local flatness
Often compute ground-truth SDF from mesh.
If closest point is on triangle face, surface normal gradient is 0.
Can’t follow gradient to improve contact normals.
Motivation
Challenges
1. Contact sparsity
2. Local flatness
Motivation
Challenges
1. Contact sparsity
2. Local flatness
3. Non-smooth object geometry

Surface normals are often discontinuous (e.g., moving from one face of cube to another).
Motivation
Challenges
1. Contact sparsity
2. Local flatness

Surface normals are often discontinuous (e.g., moving from one face of cube to another).
Can’t follow gradient across non-smooth geometry.
Motivation
Challenges
1. Contact sparsity
2. Local flatness

Motivation
Challenges
1. Contact sparsity
2. Local flatness

Motivation
Challenges
1. Contact sparsity
2. Local flatness

So how can gradient-based optimization be possible?
Motivation
Challenges & Proposed Solutions
1. Contact sparsity
2. Local flatness

Motivation
1. Contact sparsity à Leaky gradient
Can’t follow gradient to create new contacts, so allow gradient to leak through inactive contacts.
2. Local flatness

Motivation
2. Local flatness à Phong SDF

Can’t follow gradient to improve contact normals, so borrow graphics techniques for smoothing.

Motivation

Can’t follow gradient to improve contact normals, so borrow graphics techniques for smoothing.
3. Non-smooth object geometry à SDF Dilation

Can’t follow gradient across non-smooth geometry, so consider the (smoothed, padded) radius r level-set.
Motivation
An example application: Generating contact-rich grasps for high-DOF human and
robotic hands.
Like this… But how?
Challenge #1: Contact sparsity
Proper gradient
Non-zero if object SDF at contact

location is less than 0
(i.e., in collision)
and zero otherwise.
Proper gradient Leaky gradient
Non-zero if object SDF at contact Gradient when not in collision is

location is less than 0 just scaled down by alpha.
(i.e., in collision)
and zero otherwise.
Challenge #2: Local flatness
SDF ground truth is often computed from a mesh.
But surface normal is constant on faces,
so contact normal (computed as positional derivative of SDF) has 0 gradient.
Figure from Werling, K., Omens, D., Lee, J., Exarchos, I., & Liu, C. K. Fast and Feature-Complete
Differentiable Physics for Articulated Rigid Bodies with Contact.
SDF ground truth is often computed from a mesh.
But surface normal is constant on faces,
so contact normal (computed as positional derivative of SDF) has 0 gradient.
Many possible solutions!

We use one simple trick by analogy to ray-tracing: Phong tessellation.
Figure from Phong Tessellation T Boubekeur, M Alexa ACM Transactions on Graphics 27 (5)
Challenge #3: Non-smooth geometry
Easy to optimize over surface of a spherical cow (⚽),
but most aren’t so smooth (🐄).
Easy to optimize over surface of a spherical cow (⚽),
but most aren’t so smooth (🐄).
Discontinuities in surface normals

↓
discontinuities in contact normals
↓
discontinuities in their gradients with respect to
contact positions.
Luckily for us… SDFs are easy to smooth.
Figure from inigo quilez interior SDFs (2020)

Instead of the sdf=0 level set,
consider the sdf=r for some r>0.

Adjust towards true surface (r=0)
as optimization progresses.

Adjust towards true surface (r=0)
as optimization progresses.
For robotic grasping:

Hand pre-shapes as if grasping larger version of same object.

Does not help concave corners.
Future work: Is there a better transform?

Grasps from the ObMan dataset [*]
[*] Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M. J., Laptev, I., & Schmid, C. (2019). Learning joint reconstruction of hands
and manipulated objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11807-11816).
fingertip only grasps
less contact
less contact
less stable
Less contact = less friction.
less contact
less stable less plausible
Less contact = less friction. Human grasping is contact-rich.
ObMan
Ours
~30x
ObMan
Ours
~30x
ObMan
Ours
Ours
Ours
~30x
~30x
Grasp’D: Take away
3. Non-smooth object geometry à SDF Dilation
An example application: Generating contact-rich human & robotic grasps.

INR: object oriented state representations+ planning and control

Key challenge: pre-training and generalization
Work in progress
Perception, Action, and Simulation
Animesh Garg
garg@cs.toronto.edu | @animesh_garg

Robot Learning With Implicit Representations

Uploaded by

Copyright:

Available Formats

You might also like

Robot Learning With Implicit Representations

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robot Learning With Implicit Representations

Uploaded by

Copyright:

Available Formats

Robot Learning with Implicit Representations

Perception, Action, and Simulation

3D Representations in Visual Computing

Voxels Point Clouds Mesh

Occupancy Nets, CVPR 20

✗ Not Analytically Tractable

Occupancy Nets, CVPR 20

Shape reconstruction Rendering Novel view synthesis

Occupancy Networks: Learning 3D Reconstruction in Function Space. In CVPR, 2019.

Grasp detection Visuomotor control Generalization in Manipulation

Perception Action Simulation

Perception Action Simulation

Lily Goli, Daniel Rebain, Animesh Garg, Andrea Tagliasacchi

Distance between positions of

Corresponding points lie in 2D space of rendered images We derive equivalent 3D Points

• Delta will not be zero even if TL = TG, in some query points!

Barron CVPR 2019

Sampling in the moving NeRF target view (uniformly lighter)

Perception Action Simulation

Yun-Chun Chen, Adithya Murali, Bala Sundaralingam,

Grasp pose detection

Grasp pose detection

Find inverse kinematic solutions

Grasp pose detection

Find inverse kinematic solutions

Plan a collision-free trajectory

Grasp pose detection

Find inverse kinematic solutions

Plan a collision-free trajectory

Execute the open-loop trajectory

Contact-GraspNet: Efficient 6-DOF Grasp Generation in Cluttered Scenes. In ICRA, 2021.

6-DOF Grasping for Target-driven Object Manipulation in Clutter. In ICRA, 2020.

- Infer a finite discrete number of grasps

Contact-GraspNet: Efficient 6-DOF Grasp Generation in Cluttered Scenes. In ICRA, 2021.

6-DOF Grasping for Target-driven Object Manipulation in Clutter. In ICRA, 2020.

- Infer a finite discrete number of grasps

Grasp affordances are a continuous manifold

Contact-GraspNet: Efficient 6-DOF Grasp Generation in Cluttered Scenes. In ICRA, 2021.

6-DOF Grasping for Target-driven Object Manipulation in Clutter. In ICRA, 2020.

Neural Motion Fields

Grasp Motion Generation

grasp (gt , P ) = (1 V (gt , P )) + C(gt , P ), (4)

t, P ) is the predicted path length for the gripper

details on Cstorm is available in [1].

Static object poses Dynamic object poses

More data helps with snapping to multi-modal grasp prediction

Perception Action Simulation

Dylan Turpin, Liquan Wang, Eric Heiden, Yun-Chun Chen,

3. Non-smooth object geometry

3. Non-smooth object geometry

3. Non-smooth object geometry

3. Non-smooth object geometry

3. Non-smooth object geometry

3. Non-smooth object geometry

3. Non-smooth object geometry

2. Local flatness à Phong SDF

3. Non-smooth object geometry