Professional Documents
Culture Documents
Mask RCNN
Mask RCNN
- mask genera…
https://www.shuffleai.blog/blog/Understanding_Mask_R-CNN_Basic_Architecture.html 1/7
7/20/23, 4:54 PM Understanding Mask R-CNN Basic Architecture - ResNet backbone - RPN - RoIAlign - object detection branch - mask genera…
1. Backbone
A backbone is the main feature extractor of Mask R-CNN. Common choices of
this part are residual networks (ResNets) with or without FPN. For simplicity,
we take ResNet without FPN as a backbone. When we feed a raw image into a
ResNet backbone, data goes through multiple residual bottleneck blocks, and
turns into a feature map.
As the above figure shows, multiple residual bottleneck blocks with different
channel d/d' configurations are stacked to make a deep residual network. In one
bottleneck block, inputs go through two paths. One is multiple convolutional
layers and the other is identical shortcut connection. Then outputs from both
paths are added element-wisely. In this way, gradients can propagate through
blocks easily, and a block can learn an identity function easily.
Feature map from the final convolutional layer of the backbone contains abstract
informations of an image, e.g., different object instances, their classes and
spatial properties. It is then fed to the RPN.
https://www.shuffleai.blog/blog/Understanding_Mask_R-CNN_Basic_Architecture.html 2/7
7/20/23, 4:54 PM Understanding Mask R-CNN Basic Architecture - ResNet backbone - RPN - RoIAlign - object detection branch - mask genera…
2. RPN
RPN stands for Region Proposal Network. Its function is scanning the feature
map and proposing regions that may have objects in them (Region of Interest or
RoI).
https://www.shuffleai.blog/blog/Understanding_Mask_R-CNN_Basic_Architecture.html 3/7
7/20/23, 4:54 PM Understanding Mask R-CNN Basic Architecture - ResNet backbone - RPN - RoIAlign - object detection branch - mask genera…
we select ones with the highest objectness score, and drop the others. It's the
Non-max suppression process.
As so, we get a bunch of proposed RoIs. The next step is to find where exactly
each RoI is in the feature map. It's called RoIAlign.
3. RoIAlign
RoIAlign or Region of Interest alignment extracts feature vectors from a feature
map based on RoI proposed by RPN, and turn them into a fix-sized tensor for
further processes.
This operation can be illustrated by the above figure. We align RoI with their
corresponding areas in the feature map by scaling. These regions come in
different locations, scales and aspect radios. To get feature tensors of uniform
shape, we sample over relevant aligned areas of the feature map. The white-
bordered grid represents the feature map. The black-bordered grids represent
RoIs. We divide each RoI into a fixed number of bins. In each bin, there are 4
https://www.shuffleai.blog/blog/Understanding_Mask_R-CNN_Basic_Architecture.html 4/7
7/20/23, 4:54 PM Understanding Mask R-CNN Basic Architecture - ResNet backbone - RPN - RoIAlign - object detection branch - mask genera…
https://www.shuffleai.blog/blog/Understanding_Mask_R-CNN_Basic_Architecture.html 5/7
7/20/23, 4:54 PM Understanding Mask R-CNN Basic Architecture - ResNet backbone - RPN - RoIAlign - object detection branch - mask genera…
6. Summary
The basic architecture of Mask R-CNN is as explained. Here we conclude by
reviewing some aspects of it.
1. The whole model can be divide into two stages, the first stage proposes
Regions of Interest, the second stage predict classes, bounding-boxes and
masks for RoIs.
2. Instance mask generation is achieved by combining bounding-box object
detection and binary mask generation for each class, then relying on class
prediction to select the mask.
3. FPN can bring gains in average precision. The feature map and RoIAlign
should change accordingly.
That's it for this blog. I hope it's useful to you. If you have any suggestions, or
want to quote this blog, please leave a message below. Thanks for reading.
https://www.shuffleai.blog/blog/Understanding_Mask_R-CNN_Basic_Architecture.html 6/7