Professional Documents
Culture Documents
09 Det Seg Part 02
09 Det Seg Part 02
Abir Das
IIT Kharagpur
Agenda
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 2 / 95
RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 3 / 95
RCNN Architectures YOLO Segmentation
R-CNN
Regions of Interest
__:.�p- (Roi) from a proposal
method (~2k)
Input image
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 4 / 95
RCNN Architectures YOLO Segmentation
R-CNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 5 / 95
RCNN Architectures YOLO Segmentation
R-CNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 6 / 95
RCNN Architectures YOLO Segmentation
R-CNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 7 / 95
RCNN Architectures YOLO Segmentation
R-CNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 8 / 95
RCNN Architectures YOLO Segmentation
The parameters learned for this pipeline are: ConvNet, SVM Classifier and
Bounding-Box regressors
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 9 / 95
RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 10 / 95
RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 11 / 95
RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 12 / 95
RCNN Architectures YOLO Segmentation
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Slide copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 59 May 10, 2018
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 13 / 95
RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 14 / 95
RCNN Architectures YOLO Segmentation
Fast R-CNN
Fast R-CNN
source. Reproduced w
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 15 / 95
RCNN Architectures YOLO Segmentation
Fast R-CNN
Fast R-CNN
source. Reproduced w
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 16 / 95
RCNN Architectures YOLO Segmentation
Fast R-CNN
Fast R-CNN
source. Reproduced w
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 17 / 95
RCNN Architectures YOLO Segmentation
Fast R-CNN
CNN
Hi-res input image: Hi-res conv features: RoI conv features: Fully-connected layers expect
3 x 640 x 480 512 x 20 x 15; 512 x 7 x 7 low-res conv features:
with region for region proposal 512 x 7 x 7
proposal Projected region
proposal is e.g.
512 x 18 x 8
(varies per proposal)
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 18 / 95
RCNN Architectures YOLO Segmentation
Fast R-CNN
Fast R-CNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 19 / 95
RCNN Architectures YOLO Segmentation
Fast R-CNN
Fast R-CNN
(Training)
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 20 / 95
RCNN Architectures YOLO Segmentation
Fast R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 70 May 10, 2018
CS231n course, Stanford University
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 21 / 95
RCNN Architectures YOLO Segmentation
Fast R-CNN
Problem:
Runtime dominated
by region proposals!
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 70 May 10, 2018
CS231n course, Stanford University
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 22 / 95
RCNN Architectures YOLO Segmentation
Fast R-CNN
Fast RCNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 23 / 95
RCNN Architectures YOLO Segmentation
Faster R-CNN
§ The bulk of the time at test time of Fast RCNN is dominated by the
region proposal generation.
§ As Fast RCNN saved computation by sharing the feature generation
for all proposals, can some sort of sharing of computation be done for
generating region proposals?
§ The solution is to use the same CNN for region proposal generation
too.
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 24 / 95
RCNN Architectures YOLO Segmentation
Faster R-CNN
§ The bulk of the time at test time of Fast RCNN is dominated by the
region proposal generation.
§ As Fast RCNN saved computation by sharing the feature generation
for all proposals, can some sort of sharing of computation be done for
generating region proposals?
§ The solution is to use the same CNN for region proposal generation
too.
§ S Ren and K He and R Girshick and J Sun, ‘Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks’, NIPS
2015
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 24 / 95
RCNN Architectures YOLO Segmentation
Faster R-CNN
§ The bulk of the time at test time of Fast RCNN is dominated by the
region proposal generation.
§ As Fast RCNN saved computation by sharing the feature generation
for all proposals, can some sort of sharing of computation be done for
generating region proposals?
§ The solution is to use the same CNN for region proposal generation
too.
§ S Ren and K He and R Girshick and J Sun, ‘Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks’, NIPS
2015
§ The region proposal generation part is termed as the Region Proposal
Network (RPN)
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 24 / 95
RCNN Architectures YOLO Segmentation
Faster R-CNN
§ The bulk of the time at test time of Fast RCNN is dominated by the
region proposal generation.
§ As Fast RCNN saved computation by sharing the feature generation
for all proposals, can some sort of sharing of computation be done for
generating region proposals?
§ The solution is to use the same CNN for region proposal generation
too.
§ S Ren and K He and R Girshick and J Sun, ‘Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks’, NIPS
2015
§ The region proposal generation part is termed as the Region Proposal
Network (RPN)
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 24 / 95
RCNN Architectures YOLO Segmentation
Faster R-CNN
§ The bulk of the time at test time of Fast RCNN is dominated by the
region proposal generation.
§ As Fast RCNN saved computation by sharing the feature generation
for all proposals, can some sort of sharing of computation be done for
generating region proposals?
§ The solution is to use the same CNN for region proposal generation
too.
§ S Ren and K He and R Girshick and J Sun, ‘Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks’, NIPS
2015
§ The region proposal generation part is termed as the Region Proposal
Network (RPN)
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 24 / 95
RCNN Architectures YOLO Segmentation
Faster R-CNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 25 / 95
RCNN Architectures YOLO Segmentation
Faster R-CNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 25 / 95
RCNN Architectures YOLO Segmentation
Faster R-CNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 25 / 95
RCNN Architectures YOLO Segmentation
Faster R-CNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 26 / 95
RCNN Architectures YOLO Segmentation
Faster R-CNN
§ But how do we get the ground truth data to train the RPN.
Input
Faster R-CNN
§ But how do we get the ground truth data to train the RPN.
Max-pool
Conv
Input
Faster R-CNN
§ But how do we get the ground truth data to train the RPN.
Conv
Input
Faster R-CNN
§ But how do we get the ground truth data to train the RPN.
Input
Faster R-CNN
§ But how do we get the ground truth data to train the RPN.
Conv
This cell corresponds to a patch in the
original image
Consider the center of this patch
Input
Faster R-CNN
§ But how do we get the ground truth data to train the RPN.
Conv
This cell corresponds to a patch in the
original image
Consider the center of this patch
Input
We consider anchor boxes of different
sizes
Faster R-CNN
§ But how do we get the ground truth data to train the RPN.
classification
Input
Faster R-CNN
§ But how do we get the ground truth data to train the RPN.
classification regression
Faster R-CNN
Faster R-CNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 36 / 95
RCNN Architectures YOLO Segmentation
Faster R-CNN
Fast RCNN
Faster RCNN
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 37 / 95
RCNN Architectures YOLO Segmentation
YOLO
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 38 / 95
RCNN Architectures YOLO Segmentation
YOLO
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 38 / 95
RCNN Architectures YOLO Segmentation
YOLO
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 38 / 95
RCNN Architectures YOLO Segmentation
YOLO
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 39 / 95
RCNN Architectures YOLO Segmentation
YOLO
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 40 / 95
RCNN Architectures YOLO Segmentation
YOLO
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 41 / 95
RCNN Architectures YOLO Segmentation
YOLO
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 42 / 95
RCNN Architectures YOLO Segmentation
YOLO
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 43 / 95
RCNN Architectures YOLO Segmentation
YOLO
S × S grid on input
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 44 / 95
RCNN Architectures YOLO Segmentation
YOLO
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 45 / 95
RCNN Architectures YOLO Segmentation
YOLO
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 46 / 95
RCNN Architectures YOLO Segmentation
YOLO
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 47 / 95
RCNN Architectures YOLO Segmentation
YOLO
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 48 / 95
RCNN Architectures YOLO Segmentation
YOLO
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 49 / 95
RCNN Architectures YOLO Segmentation
YOLO
S × S grid on input
YOLO
S × S grid on input
YOLO
S × S grid on input
YOLO
S × S grid on input
YOLO
S × S grid on input
YOLO
S × S grid on input
YOLO
S × S grid on input
YOLO
S × S grid on input
YOLO
S × S grid on input
YOLO
S × S grid on input
YOLO
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 60 / 95
RCNN Architectures YOLO Segmentation
How
Training YOLO · · Cons
ĉ ŵ ĥ x̂ ŷ `ˆ1 `ˆ2 `ˆk
of th
§ How do we train this network
The
§ Consider a cell such that a true bounding box corresponds to this celland
c, w,
S × S grid on input
§ Initially the network with random weights will produce some values
for these (5 + C) values
§ YOLO uses sum-squared error between the predictions and the
ground truth to calculate loss. The following lossesClass
areprobability
computed map
I Localization Loss
I Confidence Loss
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 61 / 95
RCNN Architectures YOLO Segmentation
Training YOLO
Classification Loss
S2
X X 2
1obj
i pi (c) − p̂i (c)
i=0 c∈classes
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 62 / 95
RCNN Architectures YOLO Segmentation
Training YOLO
Localization Loss: It measures the errors in the predicted bounding box
locations and size. The loss is computed for the one box that is
responsible for detecting the object.
2
S X
B
X 2
λcoord 1obj
ij
2
xi − x̂i + xi − x̂i
i=0 j=0
S2
B
XX p 2 p q
obj √ 2
+λcoord 1
ij wi − ŵi + hi − ĥi
i=0 j=0
where, 1obj
ij = 1, if j
th bounding box is responsible for detecting the
Training YOLO
where, 1obj
ij = 1, if j
th bounding box is responsible for detecting the
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 64 / 95
RCNN Architectures YOLO Segmentation
Training YOLO
where, 1obj
i = 1, if j
th bounding box is responsible for predicting ‘no
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 65 / 95
RCNN Architectures YOLO Segmentation
Training YOLO
Segmentation
on Tasks
Semantic
cation Object Instance
Segmentation
zation Detection Segmentation
GRASS, CAT,
T DOG, DOG,
TREE, SKYCAT DOG, DOG, CAT
Source: cs231n course, Stanford University
Abir Das
No objects, just pixels
(IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 67 / 95
RCNN Architectures YOLO Segmentation
Segmentation
Full image
Cow
Cow
Grass
Problem: Very inefficient! Not
reusing shared features between
overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 13 May 10, 2018
Source: cs231n course, Stanford University
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 68 / 95
RCNN Architectures YOLO Segmentation
Segmentation
Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
DxHxW
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 15 May 10, 2018
Source: cs231n course, Stanford University
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 69 / 95
RCNN Architectures YOLO Segmentation
Segmentation
Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
Problem: convolutions at DxHxW
original image resolution will
be very expensive ...
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 15 May 10, 2018
Source: cs231n course, Stanford University
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 70 / 95
RCNN Architectures YOLO Segmentation
Segmentation
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 17 May 10, 2018
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 71 / 95
RCNN Architectures YOLO Segmentation
Segmentation
1 2 1 1 2 2 1 2 0 0 0 0
3 4 3 3 4 4 3 4 3 0 4 0
3 3 4 4 0 0 0 0
Segmentation
3 5 2 1 1 2 0 1 0 0
5 6
… 3 4
1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4
Corresponding pairs of
downsampling and
upsampling layers
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 19 May 10, 2018
Source: cs231n course, Stanford University
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 73 / 95
RCNN Architectures YOLO Segmentation
Segmentation
Dot product
between filter
and input
Input: 4 x 4 Output: 4 x 4
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 74 / 95
RCNN Architectures YOLO Segmentation
Segmentation
Dot product
between filter
and input
Input: 4 x 4 Output: 4 x 4
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 75 / 95
RCNN Architectures YOLO Segmentation
Segmentation
Dot product
between filter
and input
Input: 4 x 4 Output: 4 x 4
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 76 / 95
RCNN Architectures YOLO Segmentation
Segmentation
Dot product
between filter
and input
Input: 4 x 4 Output: 2 x 2
Segmentation
Dot product
between filter
and input
Input: 4 x 4 Output: 2 x 2
Segmentation
Dot product
between filter
and input
Input: 4 x 4 Output: 2 x 2
Segmentation
Input gives
weight for
filter
Input: 2 x 2 Output: 4 x 4
Segmentation
Input: 2 x 2 Output: 4 x 4
Segmentation
Input: 2 x 2 Output: 4 x 4
Segmentation
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Semantic
Fei-Fei Li &Segmentation’, CVPR Yeung
Justin Johnson & Serena 2015 Lecture 11 - 36 May 10, 2018
Source: cs231n course, Stanford University
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 83 / 95
RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 84 / 95
RCNN Architectures YOLO Segmentation
Segmentation: SegNet
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 85 / 95
RCNN Architectures YOLO Segmentation
Instance Segmentation
Visual Perception Problems
Person 4 Person 5
Person 1
Person
Person 2 Person 3
✓ ✓ ?
§ Instance segmentation not only wants to detect individual object
instances but also wants to have a segmentation mask of each
instance
§ What can be a naive idea?
Source: Kaiming He, ICCV 2017
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 86 / 95
RCNN Architectures YOLO Segmentation
Mask R-CNN
Instance Segmentation
FCN on RoI
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 87 / 95
RCNN Architectures YOLO Segmentation
? ? ? person
? ?
(proposals)
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 88 / 95
RCNN Architectures YOLO Segmentation
Parallel Heads
• Easy, fast to implement and train
cls
step1 cls
cls bbox
Feat. Feat. Feat.
step2 reg
bbox bbox
reg reg
mask
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 89 / 95
RCNN Architectures YOLO Segmentation
§ Mask R-CNN adopts the same two-stage procedure with identical first
stage [i.e., RPN] as R-CNN
§ In second stage in addition to class prediction and bounding box
regression Mask-RCNN, in parallel, outputs a binary mask for each
RoI
§ The mask branch has Km2 dimensional output for each RoI [binary
mask of m × m resolution one for each K classes] boxes
§ RoIPool breaks pixel-to-pixel translation-equivariance
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 90 / 95
RCNN Architectures YOLO Segmentation
see also “What is wrong with convolutional neural nets?”, Geoffrey Hinton, 2017
Instance Segmentation: Mask-RCNN
RoIAlign vs. RoIPool
• RoIPool breaks pixel-to-pixel translation-equivariance 😖
😖
RoIPool coordinate
quantization
quantized RoI
original RoI
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 91 / 95
RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 92 / 95
RCNN Architectures YOLO Segmentation
disconnected
object
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 93 / 95
RCNN Architectures YOLO Segmentation
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 94 / 95
RCNN Architectures YOLO Segmentation
not a kite
Abir Das (IIT Kharagpur) CS60010 March 03, 09 and 10, 2022 95 / 95