Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

Object Detection and

Localization

Academic session 2022/2023


What Is Object Detection and
Localization

For object detection, we need to classify the objects in an image and also
find the bounding box (i.e. what is in the image and where the objects are).
Task: find a pre-defined object in an image

Note that it is not concerned with object recognition.


Task: to identify an object as belonging to one of N classes

1. Training phase: produce bounding boxes on training images for


generating the pose coordinates of each object in the scene and,
2. Test phase: detect and localize simultaneously each object present in
the image.
Evolution of Object Detection
with Deep Learning Networks
Performance Retina Capsule
Net Network?
R-FCN
2017 2017
SSD
NIPS Transfor
YOLO 2016
Faster ECCV mer?
2016
Fast R- R-CNN CVPR 2017
SPP CNN NIPS 2016 GPT-3
Net ICCV 2015
R-CNN Network?
ECCV 2015
2020
CVPR 2014
2014

Timeline
Approaches to Object Detection

In general, there's two different approaches to object


detection

1. Either make a fixed number of predictions on grid (one


stage), or 

2. Leverage a proposal network to find objects and then use


a second network to fine-tune these proposals and output
a final prediction (two stage)
The Task of Object Detection
The goal is to detect instances of a predefined set of
object classes (e.g. {people, cars, bikes, animals}) and
describe the locations of each detected object in the
image using a bounding box

Images from PASCAL VOC data set


Why Object Detection is Tricky
An object classifier does this:

While an object detector does this:


Main Feature and Limitation

In one-stage fashion, there is no intermediate task which


must be performed in order to produce an output.

This leads to a simpler and faster model architecture.

However, it can sometimes struggle to be flexible enough to


adapt to arbitrary tasks. Extensive retraining may be
necessary.
Predictions on a Grid:
Start with a Backbone Network

To describe what's in an image, feed the input through a


standard convolutional network to build a rich feature
representation of the original image.

This is the "backbone" network, which is usually pre-


trained as an image classifier to learn how to extract
features from an image.
Recall our CNN:

Classification

Feature extraction

Feature extraction:
This is our Backbone Network
Predictions on a Grid:
Use a Backbone Network
A very large labelled dataset (such as ImageNet) can be used to train
the backbone network in order to learn good feature representations.
Predictions on a Grid:
The Backbone Network Output
After pre-training remove the last few layers of the network. The backbone
network now outputs a collection of stacked feature maps which describe the
original image in a low spatial resolution but high feature (channel)
resolution (7x7x512 in this network)
Predictions on a Grid:
Relating Back to the Original Image
Predictions on a Grid:
Object is at the Centre Cell
Objects are roughly located in the coarse (7x7) feature maps at the cell
containing the centre of the bounding box annotation. This grid cell is
"responsible" for detecting that specific object.
Predictions on a Grid:
How to Detect Centre Cell
In order to detect the object, add another convolutional layer and learn
the kernel parameters which combine the context of all 512 feature maps
in order to produce an activation corresponding with the grid cell which
contains the object
Predictions on a Grid:
Multiple Activations

If the input image contains multiple objects, we should have multiple


activations on our grid denoting that an object is in each of the activated
regions.
Predictions on a Grid:
How to Describe a Detected Object

An object cannot be sufficiently described with a single


activation. We need:

1. The likelihood that a grid cell contains an object (Pobj)


2. Bounding box descriptors x,y,width,height (tx, ty, tw, th)
3. Which class the object belongs to (C1, C2, … Cn)

Thus, a convolution filter is needed for each of the above


attributes such that we produce 5+C output channels to
describe a single bounding box at each grid cell location.
Predictions on a Grid:
How to Describe Detected Object

5+C convolutional filters
produce one bounding box
descriptor for each grid cell
Predictions on a Grid:
Multiple Objects on the Same Cell

Images might have multiple objects which "belong" to the same grid cell.
We can alter the layer to produce B(5+C) filters such that we can predict B
bounding boxes for each grid cell location
Predictions on a Grid:
Multiple Objects on the Same Cell
The model will always produce a fixed number of N×N×B predictions for a
given image. We then filter the predictions to only consider bounding boxes
with a probability above some defined threshold.
Predictions on a Grid:
Multiple Objects in Parallel
Multiple objects can be detected in parallel.
However, a large number of grid cells contain no object and this introduces a
large imbalance between the predicted bounding boxes which contain an object
and those which do not contain an object.
Predictions on a Grid:
Non-maximum Suppression
The approach thus far produces a fixed number of bounding box predictions for
each image. BUT we would like output bounding boxes for objects that are
actually likely to be in the image.

Non-max suppression is applied to each class separately. The goal is to


remove redundant predictions.
Region Based Methods v.
One-Stage Methods

Region-proposal methods produce state-of-the-art results but are often too


computationally expensive for real-time object detection, especially on embedded
systems.

YOLO-You only look once (Redmon et al. 2015) and SSD-Single Shot MultiBox


Detector (Liu et al. 2015) solve this problem by predicting bounding box coordinates
and probabilities for different categories in a single forward pass through the
network. They are optimized for speed at the cost of accuracy.

Lin et al. (2017) explained why methods like SSD are less accurate than two-stage
methods and proposed to address the problem by rescaling the loss function. The
improvement implemented as RetinaNet means that single-shot methods are faster
and now as accurate as two-stage methods.
YOLO: You-Only-Look-Once &
SSD: Single Shot MultiBox Detector

The idea is to divide the image into multiple grids. Then change the label of the data
such that we implement both localization and classification algorithm for each grid cell
How to Get the Bounding Box
One of the things that may be difficult to understand at first is how the detection system
will convert the cells to an actual bounding box that fits the object
Strategies to Define the
Bounding Box

Object detectors strategy:

1. YOLO: Uses a single activation map for prediction of classes and


bounding boxes

2. SSD: Uses different activation maps (multiple-scales) for prediction


of classes and bounding boxes

YOLO works similarly to SSD with the difference that it uses fully
connected layers instead of only convolutional layers at the end of the
network. SSD seems superior.
The SSD Detector
An input image is passed through the backbone truncated network.
In this example, three more convolutions create three feature maps at
the top of the network with the shapes [256, 4, 4] (blue), [256, 2, 2]
(yellow), and finally [256, 1, 1] (green):
SSD Receptive Field in the Last
Layer
The activations in the final layer have dependencies
on all activations in the previous layers and the
receptive field is thus the entire input image.
SSD Receptive Field in Other
Layers
Note that the receptive field of an activation in the
yellow layer is only one quarter of the input image
SSD Anchor Boxes
This leads to anchor or default boxes. Every default box needs n values that
represent the probabilities that a certain class was detected in that box and 4 values
that now are not absolute coordinates of the predicted bounding box but rather the
offset to the respective default box.

The important idea is that we do for every default box in the differently sized grids what we
did when predicting one single object in an image.
SSD Summary
In Summary:

1. We defined several grids of differently sized default boxes that will allow us to detect
objects at different scales in one single forward pass.

2. For each default box in every grid, the network outputs n class probabilities and 4
offsets to the respective default box coordinates which give the predicted bounding
box coordinates.
Matching Bounding Boxes:
Compare Predicted with Default

We want to match the ground truth bounding box to a default box that is “as
similar” to it as possible. Two boxes are similar when they overlap as much
as possible while having as little as possible area that is non-overlapping.
This is defined by the Jaccard index or IoU Intersection over Union index:

The idea is that we want to compare the ground truth bounding boxes in the training
example to predictions made by default boxes that are already very similar to the ground
truth bounding boxes.
One More Thing

For every anchor, SSD defines k default boxes of different aspect ratios


and sizes instead of just one (here we just presented one k=1).

This means that we also need 4 times k predicted offsets to the


respective default box coordinates instead of 4 and n times k class
probabilities instead of n.

More boxes of different sizes and aspect ratios = better object detection.
This detail is not too important for general understanding of SSD, but
important for implementation.

(see Fig. 1 in Liu et al. 2015 for an illustration of this)


A Remaining Problem
SSD and YOLO need only one forward pass through the network to
predict object bounding boxes and class probabilities.

SSD uses multi-scale convolutional feature maps at the top of the


network while YOLO uses fully connected layers.

In general, SSD is faster and more accurate than YOLO.

Only remaining problem: region proposal methods such as R-CNN


are more accurate.

Recall: methods like SSD or YOLO suffer from an extreme class imbalance: The
detectors evaluate roughly between ten to hundred thousand candidate locations (way
more than the 4x4, 2x2 + 1 default boxes in the previous example shown here).
Cross Entropy Loss Function
Entropy is a measure of the uncertainty associated with a given
distribution.

Cross-entropy: used as a loss function to measure the difference


between the predicted probability distribution of a model and the true
probability distribution of the data. Used in classification problems.

The standard cross entropy loss function for multi-class classification


problems is calculated by summing the individual losses for each class:

Where yi is the true probability and pi is the predicted probability.


The Imbalance Problem
Standard cross entropy loss function:
RetinaNet: Solving the Imbalance
Problem with Focal Loss

 Lin et al. (2017) scale the cross entropy loss so that all the easy examples the
network is already very sure about contribute less to the loss so that the learning
can focus on the few interesting cases. Gamma=2 seems to work best.
Final Notes on RetinaNet
With Focal Loss, when the network is pretty sure about a prediction, the
loss is now significantly lower.

In our previous example of 80% certainty, the cross entropy loss had a
value of ~0.22 and now the focal loss a value of only ~0.04.

For predictions the network is not so sure about, the loss got reduced by a
much smaller factor.

With this powerful improvement, single forward pass are able to compete
with two-stage methods regarding accuracy while easily beating them with
respect to speed.

This opens many new possibilities for accurate real-time object detection
even on embedded systems.

You might also like