Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

7/20/23, 4:54 PM Understanding Mask R-CNN Basic Architecture - ResNet backbone - RPN - RoIAlign - object detection branch - mask

- mask genera…

menu SHUFFLE Home About Contact

Understanding Mask R-CNN Basic

Basic architecture of Mask R-CNN network and the ideas behind it

Nov 14, 2021 by Xiang Zhang

Mask R-CNN is a popular deep learning framework for instance segmentation

task in computer vision field. It adds fully convolutional networks (FCN) to
Faster R-CNN to generate mask for each object, while Faster R-CNN, Fast R-
CNN, R-CNN is for bounding-box object detection. Mask R-CNN can be
composed by these parts: a backbone, a Region Proposal Network (RPN), a
Region of Interest alignment layer (RoIAlign), a bounding-box object detection
head and a mask generation head. The first four make up the Faster R-CNN
model. So the overall structure can be illustrated by the following figure. 1/7
7/20/23, 4:54 PM Understanding Mask R-CNN Basic Architecture - ResNet backbone - RPN - RoIAlign - object detection branch - mask genera…

1. Backbone
A backbone is the main feature extractor of Mask R-CNN. Common choices of
this part are residual networks (ResNets) with or without FPN. For simplicity,
we take ResNet without FPN as a backbone. When we feed a raw image into a
ResNet backbone, data goes through multiple residual bottleneck blocks, and
turns into a feature map.

As the above figure shows, multiple residual bottleneck blocks with different
channel d/d' configurations are stacked to make a deep residual network. In one
bottleneck block, inputs go through two paths. One is multiple convolutional
layers and the other is identical shortcut connection. Then outputs from both
paths are added element-wisely. In this way, gradients can propagate through
blocks easily, and a block can learn an identity function easily.

Feature map from the final convolutional layer of the backbone contains abstract
informations of an image, e.g., different object instances, their classes and
spatial properties. It is then fed to the RPN. 2/7
7/20/23, 4:54 PM Understanding Mask R-CNN Basic Architecture - ResNet backbone - RPN - RoIAlign - object detection branch - mask genera…

2. RPN
RPN stands for Region Proposal Network. Its function is scanning the feature
map and proposing regions that may have objects in them (Region of Interest or

Concretely, a convolutional layer processes the feature map, outputs a c-channel

tensor whose each spacial vector (also have c channels) is associated with an
anchor center. A set of anchor boxes with different scales and aspect ratios are
generated given one anchor center. These anchor boxes are different areas that
evenly distributed over the whole image and cover it completely. Then two
sibling 1 by 1 convolutional layers process the c-channel tensor. One is a binary
classifier. It predicts whether each anchor box has an object. It maps each c-
channel vector to a k-channel vector (represents k anchor boxes with different
scales and aspect ratios sharing one anchor center). The other is a object
bounding-box regressor. It predicts the offsets between the true object
bounding-box and the anchor box. It maps each c-channel vector to a 4k-channel
vector. For those overlapped bounding-boxes that may suggest the same object, 3/7
7/20/23, 4:54 PM Understanding Mask R-CNN Basic Architecture - ResNet backbone - RPN - RoIAlign - object detection branch - mask genera…

we select ones with the highest objectness score, and drop the others. It's the
Non-max suppression process.

As so, we get a bunch of proposed RoIs. The next step is to find where exactly
each RoI is in the feature map. It's called RoIAlign.

3. RoIAlign
RoIAlign or Region of Interest alignment extracts feature vectors from a feature
map based on RoI proposed by RPN, and turn them into a fix-sized tensor for
further processes.

This operation can be illustrated by the above figure. We align RoI with their
corresponding areas in the feature map by scaling. These regions come in
different locations, scales and aspect radios. To get feature tensors of uniform
shape, we sample over relevant aligned areas of the feature map. The white-
bordered grid represents the feature map. The black-bordered grids represent
RoIs. We divide each RoI into a fixed number of bins. In each bin, there are 4 4/7
7/20/23, 4:54 PM Understanding Mask R-CNN Basic Architecture - ResNet backbone - RPN - RoIAlign - object detection branch - mask genera…

dots representing sample locations. We sample feature vectors on the feature

map grid around each dot and compute their bilinear interpolation as the dot
vector. Then we pool dot vectors within one bin to get a smaller fix-sized feature
map for each RoI. Next, we put each RoI's feature map into a set of residual
bottleneck blocks to extract features further. The results represent every RoI's
finer feature map and will be processed by two following parallel branches:
object detection branch and mask generation branch.

4. Object detection branch

After we get individual RoI feature map, we can predict its object category and a
finer instance bounding-box. This branch is a fully-connected layer that maps
feature vectors to the final n classes and 4n instance bounding-box coordinates.

5. Mask generation branch

On the mask generation branch, we feed RoI feature map to a transposed
convolutional layer and a convolutional layer successively. This branch is a fully
convolutional network. One binary segmentation mask is generated for one
class. Then we pick the output mask according to the class prediction in object
detection branch. In this way, per-pixel's mask prediction can avoid competition
between different classes. 5/7
7/20/23, 4:54 PM Understanding Mask R-CNN Basic Architecture - ResNet backbone - RPN - RoIAlign - object detection branch - mask genera…

6. Summary
The basic architecture of Mask R-CNN is as explained. Here we conclude by
reviewing some aspects of it.

1. The whole model can be divide into two stages, the first stage proposes
Regions of Interest, the second stage predict classes, bounding-boxes and
masks for RoIs.
2. Instance mask generation is achieved by combining bounding-box object
detection and binary mask generation for each class, then relying on class
prediction to select the mask.
3. FPN can bring gains in average precision. The feature map and RoIAlign
should change accordingly.

That's it for this blog. I hope it's useful to you. If you have any suggestions, or
want to quote this blog, please leave a message below. Thanks for reading.

Published by Xiang Zhang

Hi everyone! My name is Xiang Zhang. I am passionate about

the huge progress that deep learning has brought to various
fields. I like studying them and sharing my learning experience. 6/7

You might also like