Object Detection and Localization: Academic Session 2022/2023

Object Detection and
Localization
Academic session 2022/2023

What Is Object Detection and
Localization
For object detection, we need to classify the objects in an image and also
find the bounding box (i.e. what is in the image and where the objects are).
Task: find a pre-defined object in an image
Note that it is not concerned with object recognition.

Task: to identify an object as belonging to one of N classes
1. Training phase: produce bounding boxes on training images for

generating the pose coordinates of each object in the scene and,
2. Test phase: detect and localize simultaneously each object present in
the image.
Evolution of Object Detection
with Deep Learning Networks
Performance Retina Capsule
Net Network?
R-FCN
2017 2017
SSD
NIPS Transfor
YOLO 2016
Faster ECCV mer?
2016
Fast R- R-CNN CVPR 2017
SPP CNN NIPS 2016 GPT-3
Net ICCV 2015
R-CNN Network?
ECCV 2015
2020
CVPR 2014
2014
Timeline
Approaches to Object Detection
In general, there's two different approaches to object

detection
1. Either make a fixed number of predictions on grid (one

stage), or
2. Leverage a proposal network to find objects and then use

a second network to fine-tune these proposals and output
a final prediction (two stage)
The Task of Object Detection
The goal is to detect instances of a predefined set of
object classes (e.g. {people, cars, bikes, animals}) and
describe the locations of each detected object in the
image using a bounding box
Images from PASCAL VOC data set

Why Object Detection is Tricky
An object classifier does this:
While an object detector does this:

Main Feature and Limitation
In one-stage fashion, there is no intermediate task which

must be performed in order to produce an output.
This leads to a simpler and faster model architecture.
However, it can sometimes struggle to be flexible enough to

adapt to arbitrary tasks. Extensive retraining may be
necessary.
Predictions on a Grid:
Start with a Backbone Network
To describe what's in an image, feed the input through a

standard convolutional network to build a rich feature
representation of the original image.
This is the "backbone" network, which is usually pre-

trained as an image classifier to learn how to extract
features from an image.
Recall our CNN:
Classification
Feature extraction
Feature extraction:
This is our Backbone Network
Use a Backbone Network
A very large labelled dataset (such as ImageNet) can be used to train
the backbone network in order to learn good feature representations.
The Backbone Network Output
After pre-training remove the last few layers of the network. The backbone
network now outputs a collection of stacked feature maps which describe the
original image in a low spatial resolution but high feature (channel)
resolution (7x7x512 in this network)
Relating Back to the Original Image
Object is at the Centre Cell
Objects are roughly located in the coarse (7x7) feature maps at the cell
containing the centre of the bounding box annotation. This grid cell is
"responsible" for detecting that specific object.
How to Detect Centre Cell
In order to detect the object, add another convolutional layer and learn
the kernel parameters which combine the context of all 512 feature maps
in order to produce an activation corresponding with the grid cell which
contains the object
Multiple Activations
If the input image contains multiple objects, we should have multiple

activations on our grid denoting that an object is in each of the activated
regions.
How to Describe a Detected Object
An object cannot be sufficiently described with a single

activation. We need:
1. The likelihood that a grid cell contains an object (Pobj)

2. Bounding box descriptors x,y,width,height (tx, ty, tw, th)
3. Which class the object belongs to (C1, C2, … Cn)
Thus, a convolution filter is needed for each of the above

attributes such that we produce 5+C output channels to
describe a single bounding box at each grid cell location.
How to Describe Detected Object
5+C convolutional filters
produce one bounding box
descriptor for each grid cell
Multiple Objects on the Same Cell
Images might have multiple objects which "belong" to the same grid cell.
We can alter the layer to produce B(5+C) filters such that we can predict B
bounding boxes for each grid cell location
Multiple Objects on the Same Cell
The model will always produce a fixed number of N×N×B predictions for a
given image. We then filter the predictions to only consider bounding boxes
with a probability above some defined threshold.
Multiple Objects in Parallel
Multiple objects can be detected in parallel.
However, a large number of grid cells contain no object and this introduces a
large imbalance between the predicted bounding boxes which contain an object
and those which do not contain an object.
Non-maximum Suppression
The approach thus far produces a fixed number of bounding box predictions for
each image. BUT we would like output bounding boxes for objects that are
actually likely to be in the image.
Non-max suppression is applied to each class separately. The goal is to

remove redundant predictions.
Region Based Methods v.
One-Stage Methods
Region-proposal methods produce state-of-the-art results but are often too

computationally expensive for real-time object detection, especially on embedded
systems.
YOLO-You only look once (Redmon et al. 2015) and SSD-Single Shot MultiBox

Detector (Liu et al. 2015) solve this problem by predicting bounding box coordinates
and probabilities for different categories in a single forward pass through the
network. They are optimized for speed at the cost of accuracy.
Lin et al. (2017) explained why methods like SSD are less accurate than two-stage
methods and proposed to address the problem by rescaling the loss function. The
improvement implemented as RetinaNet means that single-shot methods are faster
and now as accurate as two-stage methods.
YOLO: You-Only-Look-Once &
SSD: Single Shot MultiBox Detector
The idea is to divide the image into multiple grids. Then change the label of the data
such that we implement both localization and classification algorithm for each grid cell
How to Get the Bounding Box
One of the things that may be difficult to understand at first is how the detection system
will convert the cells to an actual bounding box that fits the object
Strategies to Define the
Bounding Box
Object detectors strategy:
1. YOLO: Uses a single activation map for prediction of classes and

bounding boxes
2. SSD: Uses different activation maps (multiple-scales) for prediction

of classes and bounding boxes
YOLO works similarly to SSD with the difference that it uses fully
connected layers instead of only convolutional layers at the end of the
network. SSD seems superior.
The SSD Detector
An input image is passed through the backbone truncated network.
In this example, three more convolutions create three feature maps at
the top of the network with the shapes [256, 4, 4] (blue), [256, 2, 2]
(yellow), and finally [256, 1, 1] (green):
SSD Receptive Field in the Last
Layer
The activations in the final layer have dependencies
on all activations in the previous layers and the
receptive field is thus the entire input image.
SSD Receptive Field in Other
Layers
Note that the receptive field of an activation in the
yellow layer is only one quarter of the input image
SSD Anchor Boxes
This leads to anchor or default boxes. Every default box needs n values that
represent the probabilities that a certain class was detected in that box and 4 values
that now are not absolute coordinates of the predicted bounding box but rather the
offset to the respective default box.
The important idea is that we do for every default box in the differently sized grids what we
did when predicting one single object in an image.
SSD Summary
In Summary:
1. We defined several grids of differently sized default boxes that will allow us to detect
objects at different scales in one single forward pass.
2. For each default box in every grid, the network outputs n class probabilities and 4
offsets to the respective default box coordinates which give the predicted bounding
box coordinates.
Matching Bounding Boxes:
Compare Predicted with Default
We want to match the ground truth bounding box to a default box that is “as
similar” to it as possible. Two boxes are similar when they overlap as much
as possible while having as little as possible area that is non-overlapping.
This is defined by the Jaccard index or IoU Intersection over Union index:
The idea is that we want to compare the ground truth bounding boxes in the training
example to predictions made by default boxes that are already very similar to the ground
truth bounding boxes.
One More Thing
For every anchor, SSD defines k default boxes of different aspect ratios

and sizes instead of just one (here we just presented one k=1).
This means that we also need 4 times k predicted offsets to the

respective default box coordinates instead of 4 and n times k class
probabilities instead of n.
More boxes of different sizes and aspect ratios = better object detection.
This detail is not too important for general understanding of SSD, but
important for implementation.
(see Fig. 1 in Liu et al. 2015 for an illustration of this)

A Remaining Problem
SSD and YOLO need only one forward pass through the network to
predict object bounding boxes and class probabilities.
SSD uses multi-scale convolutional feature maps at the top of the

network while YOLO uses fully connected layers.
In general, SSD is faster and more accurate than YOLO.
Only remaining problem: region proposal methods such as R-CNN

are more accurate.
Recall: methods like SSD or YOLO suffer from an extreme class imbalance: The
detectors evaluate roughly between ten to hundred thousand candidate locations (way
more than the 4x4, 2x2 + 1 default boxes in the previous example shown here).
Cross Entropy Loss Function
Entropy is a measure of the uncertainty associated with a given
distribution.
Cross-entropy: used as a loss function to measure the difference

between the predicted probability distribution of a model and the true
probability distribution of the data. Used in classification problems.
The standard cross entropy loss function for multi-class classification

problems is calculated by summing the individual losses for each class:
Where yi is the true probability and pi is the predicted probability.

The Imbalance Problem
Standard cross entropy loss function:
RetinaNet: Solving the Imbalance
Problem with Focal Loss
Lin et al. (2017) scale the cross entropy loss so that all the easy examples the
network is already very sure about contribute less to the loss so that the learning
can focus on the few interesting cases. Gamma=2 seems to work best.
Final Notes on RetinaNet
With Focal Loss, when the network is pretty sure about a prediction, the
loss is now significantly lower.
In our previous example of 80% certainty, the cross entropy loss had a
value of ~0.22 and now the focal loss a value of only ~0.04.
For predictions the network is not so sure about, the loss got reduced by a
much smaller factor.
With this powerful improvement, single forward pass are able to compete
with two-stage methods regarding accuracy while easily beating them with
respect to speed.
This opens many new possibilities for accurate real-time object detection
even on embedded systems.

Object Detection and Localization: Academic Session 2022/2023

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Object Detection and Localization: Academic Session 2022/2023

Uploaded by

Copyright:

Available Formats

Object Detection and

Academic session 2022/2023

Note that it is not concerned with object recognition.

1. Training phase: produce bounding boxes on training images for

In general, there's two different approaches to object

1. Either make a fixed number of predictions on grid (one

2. Leverage a proposal network to find objects and then use

Images from PASCAL VOC data set

While an object detector does this:

In one-stage fashion, there is no intermediate task which

This leads to a simpler and faster model architecture.

However, it can sometimes struggle to be flexible enough to

To describe what's in an image, feed the input through a

This is the "backbone" network, which is usually pre-

If the input image contains multiple objects, we should have multiple

An object cannot be sufficiently described with a single

1. The likelihood that a grid cell contains an object (Pobj)

Thus, a convolution filter is needed for each of the above

Non-max suppression is applied to each class separately. The goal is to

Region-proposal methods produce state-of-the-art results but are often too

YOLO-You only look once (Redmon et al. 2015) and SSD-Single Shot MultiBox

Object detectors strategy:

1. YOLO: Uses a single activation map for prediction of classes and

2. SSD: Uses different activation maps (multiple-scales) for prediction

For every anchor, SSD defines k default boxes of different aspect ratios

This means that we also need 4 times k predicted offsets to the

(see Fig. 1 in Liu et al. 2015 for an illustration of this)

SSD uses multi-scale convolutional feature maps at the top of the

In general, SSD is faster and more accurate than YOLO.

Only remaining problem: region proposal methods such as R-CNN

Cross-entropy: used as a loss function to measure the difference

The standard cross entropy loss function for multi-class classification

Where yi is the true probability and pi is the predicted probability.

You might also like