Professional Documents
Culture Documents
Unit 5a - Machine Vision
Unit 5a - Machine Vision
Unit 5a - Machine Vision
Chapter 10 from
Deep Learning Illustrated book
1
Biological Vision
Hubel and Wiesel discovered that the neurons
that receive visual input from the eye are in
general most responsive to simple, straight
edges at particular, specific orientations.
Fittingly, they named these cells simple neurons.
A large group of simple neurons together is able
to represent all 360 degrees of orientation.
These edge-orientation detecting simple cells
then pass along information to a large number of
so-called complex neurons.
Capable of detecting complex shapes like a corner or
a curve. 2
Machine Vision
3
Convolutional Neural Networks
The studies of visual cortex inspired the
neocognitron, introduced in 1980, which
gradually evolved into what we now call
convolutional neural networks.
In 1998, Yann LeCun et al. introduced the
famous LeNet-5 architecture, widely used
by banks to recognize handwritten check
numbers.
Introduced two new building blocks:
convolutional layers and pooling layers.
4
Machine Vision
One of the advantages LeNet-5 had over the
neocognitron was a larger, high-quality set of training
data, MNIST.
The next breakthrough in neural networks was also
facilitated by a high-quality public dataset, this time
much larger.
The 14 million images in the ImageNet dataset are
spread across 22,000 categories as diverse as
container ships, leopards, starfish, and elderberries.
Since 2010, Fei-Fei Li has run an open challenge
called ILSVRC on a subset of 1.4 million images
across 1,000 categories.
5
Machine Vision
In the first two years of the ILSVRC all algorithms
hailed from the feature-engineering-driven ML.
In the third year, all entrants except one were
traditional ML algorithms.
If that one deep learning model in 2012 had not been
developed or if its creators had not competed in
ILSVRC, the year-over-year improvement in image
classification accuracy would have been negligible.
Geoffrey Hinton’s group had blown the competition
out of water (AlexNet won with a huge margin of 8%).
In an instant, deep learning architectures emerged
from the fringes of machine learning to its fore.
6
Machine Vision
7
Machine Vision
Three principal factors enabled AlexNet to be the
state-of-the-art machine vision algorithm in 2012.
First is the training data - they also artificially
expanded the already massive ImageNet data.
Second is processing power - they programmed
two GPUs to train their large datasets with previously
unseen efficiency.
Third is architectural advances - AlexNet is deeper
than LeNet-5, and it took advantage of both a new
type of artificial neuron and a nifty trick that helps
generalize deep learning models beyond the data
they’re trained on.
8
Convolution operation
Input
Kernel
a b c d
w x
e f g h
y z
i j k l
Output
aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz
ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz
9
Convolutional Neural Networks
Convolution leverages three important
ideas that help improve machine learning
systems
1. Sparse interactions
2. Parameter sharing
3. Equivariant representations
CNNs take advantage of spatial
information
local patterns that are translation-invariant.
and spatial hierarchies of these patterns.
10
Pooling Layers
The pooling function replaces the output of
the net at a certain location with a
summary statistic of the nearby outputs
Max pooling reports the maximum output
within a rectangular neighborhood
Average pooling reports the average output
Pooling helps make the representation
approximately invariant to small input
translations.
Max pooling layer is the most commonly
used and performs better.
11
Pooling Layers
Typically, a pooling layer has a filter size of 2 x 2
and a stride length of 2.
In this case, at each position the pooling layer
evaluates four activations, retaining only the
maximum value.
Thereby downsampling the activations by a
factor of 4.
An alternative approach to pooling for reducing
computational complexity is to use a
convolutional layer with a larger stride that tend
to perform better without pooling layers.
12
Convolutional Filter Hyperparameters
Kernel size
Padding
Stride length
13
CNNs popularity in Machine Vision
1. They allow deep learning models to learn to
recognize features in a position invariant
manner.
2. They remain faithful to the two-dimensional
structure of images, allowing features to be
identified within their spatial context.
3. They significantly reduce the number of
parameters required for modeling image data,
yielding higher computational efficiency.
4. Ultimately, they perform machine vision tasks
(e.g., image classification) more accurately.
14
LeNet-5 architecture
The LeNet-5 architecture was created by Yann LeCun in
1998 and has been widely used for handwritten digit
recognition (MNIST).
It introduced most of the elements in modern CNNs.
17
Alexnet (2012) architecture
We moved beyond monochromatic MNIST into full
colour images of larger size 227x227.
AlexNet used larger filter sizes in the earliest
convolutional layers relative to what is popular
today—for example, kernel_size =(11, 11).
A larger stride of 4 in the first convolution layer.
Used ReLU activations rather than tanh.
18
VGG-16 architecture
VGGNet follows the same repeated conv-
pool-block structure as AlexNet;
VGGNet simply has more of them, and with
smaller (all 3x3) kernel sizes.
19
Vanishing Gradients: The Bête Noire of
Deep CNNs
With more layers, models are able to learn
a larger variety of relatively low-level
features in the early layers, and
increasingly complex abstractions are
made possible in the later layers via
nonlinear recombination.
This approach, however, has limits: If we
continue to simply make our networks
deeper, they will eventually be debilitated
by the vanishing gradient problem.
20
Vanishing Gradients: The Bête Noire of
Deep CNNs
The basic issue of vanishing gradients is that
parameters in early layers of the network are far
away from the cost function: the source of the
gradient that is propagated backward through the
network.
As the error is backpropagated, each layer closer
to the input gets a smaller and smaller update.
The net effect is that early layers in increasingly
deep networks become more difficult to train.
ResNet introduced an innovative solution to this
problem – skip (or residual) connections.
21
Residual Networks
Residual Network (from Microsoft Research) is
the winner of the ILSVRC 2015 challenge.
delivered an astounding error rate under 3.6%.
23
Residual Networks
The residual modules either learn something
useful and contribute to reducing the error of
the network, or they perform identity mapping
and do nothing at all.
This neutral-or-better characteristic of residual
networks solves the issue of vanishing gradients.
When several residual modules are stacked, the
later residual modules receive inputs that are
increasingly complex combinations of the
residual modules and skip connections from
earlier in the network.
24
Residual Networks
25
Residual Networks
The residual connections managed to squeeze
out more juice relative to the existing networks
by enabling much deeper architectures without
the decrease in performance associated with
those extra layers if they fail to learn useful
information about the problem.
In 2015, ResNet took first place not only in the
ILSVRC image-classification competition but in
the object detection and image segmentation
categories, too.
26
Applications of Machine Vision
28
Object Detection – R_CNN
R-CNN was proposed in 2013 by Ross
Girshick and his colleagues at UC
Berkeley.
The algorithm was modeled on the
attention mechanism of the human brain,
wherein an entire scene is scanned and
focus is placed on specific regions of
interest.
29
Object Detection – R_CNN
To emulate this attention, R-CNNs:
1. Perform a selective search for regions of
interest (ROIs) within the image.
2. Extract features from these ROIs by using
a CNN.
3. Combine two “traditional” ML approaches
(linear regression and SVMs) to refine the
locations of bounding boxes and classify
objects within each of those boxes.
30
[Elgendy 20] Deep Learning for Vision Systems
31
Object Detection – R_CNN
Object Detection – R_CNN
R-CNNs achieved a massive gain in performance
over the previous best model in the Pattern
Analysis, Statistical Modeling and Computational
Learning (PASCAL) Visual Object Classes (VOC)
competition.
However, they have some limitations:
It is inflexible: The input size was fixed to a single
specific image shape.
It is slow and computationally expensive: Both
training and inference are multistage processes
involving CNNs, linear regression models, and SVMs.
32
Object Detection – Fast R_CNN
To address the primary drawback of R-CNN—its
speed—Girshick went on to develop Fast R-
CNN.
The chief innovation here was the realization that
during step 2 of the R-CNN algorithm, the CNN
was unnecessarily being run multiple times, once
for each region of interest.
With Fast R-CNN, the ROI search (step 1) is run
as before, but during step 2, the CNN is given a
single global look at the image, and the extracted
features are used for all ROIs simultaneously.
33
Object Detection – Fast R_CNN
A vector of features is extracted from the final
layer of the CNN, which (for step 3) is then fed
into a dense network along with the ROI.
This dense net learns to focus on only the
features that apply to each individual ROI,
culminating in two outputs per ROI:
1. A softmax probability output over the
classification categories (for a prediction of
what class the detected object belongs to)
2. A bounding box regressor (for refinement of
the ROI’s location)
34
35
Object Detection – Fast R_CNN
38
39
Object Detection – Faster R_CNN
40
Object Detection - YOLO
The R-CNN approach focused on the individual
proposed ROIs as opposed to the whole input
image.
Joseph Redmon and coworkers published on
You Only Look Once (YOLO) in 2015, which
bucked this trend.
YOLO begins with a pretrained CNN for feature
extraction.
Next, the image is divided into a series of cells,
and, for each cell, a number of bounding boxes
and object-classification probabilities are
predicted.
41
Object Detection - YOLO
Bounding boxes with class probabilities above a
threshold value are selected, and these combine
to locate an object within an image.
We can think of the YOLO method as
aggregating many smaller bounding boxes, but
only if they have a reasonably good probability of
containing any given object class.
The algorithm improved on the speed of Faster
R-CNN, but it struggled to accurately detect small
objects in an image.
Later versions of YOLO are a lot more accurate than
R-CNNs! 42
Object Detection
The R-CNN family of networks has three main variations:
R-CNN, Fast R-CNN, and Faster R-CNN.
R-CNN and Fast R-CNN use a selective search algorithm to
propose RoIs, whereas Faster R-CNN is an end-to-end DL
system that uses a region proposal network to propose RoIs.
R-CNN approach is a multi-stage detector: it separates the
process to predict the objectness score of the bounding box and
the object class into two different stages.
YOLO are single-stage detectors: the image is passed
once through the network to predict the objectness score
and the object class.
In general, single-stage detectors tend to be less
accurate (early versions of YOLO) than two-stage
detectors but are significantly faster.
43
Image Segmentation
44
Image Segmentation - Mask R-CNN
Mask R-CNN was developed by Facebook AI Research
(FAIR) in 2017. This approach involves:
1. Using the existing Faster R-CNN architecture to propose
ROIs within the image that are likely to contain objects.
2. An ROI classifier predicting what kind of object exists in
the bounding box while also refining the location and
size of the bounding box.
3. Using the bounding box to grab the parts of the feature
maps from the underlying CNN that correspond to that
part of the image.
4. Feeding the feature maps for each ROI into a fully
convolutional network that outputs a mask indicating
which pixels correspond to the object in the image.
45
Image Segmentation - Mask R-CNN
Image segmentation problems require binary
masks as labels for training.
These consist of arrays of the same dimensions
as the original image.
However, instead of RGB pixel values they
47
Image Segmentation – U-Net
The U-Net model consists of a fully convolutional
architecture, which begins with a contracting path
that produces successively smaller and deeper
activation maps through multiple convolution and
max-pooling steps.
Subsequently, an expanding path restores these
deep activation maps back to full resolution
through multiple upsampling and convolution
steps.
These two paths—the contracting and expanding
paths—are symmetrical (forming a “U” shape).
48
Image Segmentation – U-Net
The contracting path serves to allow the model to
learn high-resolution features from the image.
These high-res features are handed directly to
the expanding path (thru skip connection).
After concatenating the feature maps from the
contracting path onto the expanding path, a
subsequent convolutional layer allows the
network to learn to assemble and localize these
features precisely.
The final result is a network that is highly adept
both at identifying and locating those features.
49
Transfer Learning
Transfer learning takes advantage of library of
existing visual elements contained within the
feature maps of a pretrained CNN and
repurposes them to become specialized in
identifying new classes of objects.
4 tools in our deep learning toolbox
Training from scratch on a small dataset
50
Transfer Learning
Now, we apply transfer learning with
VGG19 for hot dog detection.
vgg19 = VGG19(include_top=False,
weights='imagenet',
input_shape=(224,224,3),
pooling=None)
# Freeze all the layers in the base VGGNet19 model:
for layer in vgg19.layers:
layer.trainable = False
51
Transfer Learning
model = Sequential()
model.add(vgg19)
# Add the custom layers atop the VGG19 model:
model.add(Flatten(name='flattened'))
model.add(Dropout(0.5, name='dropout'))
model.add(Dense(2, activation='softmax', name='predictions'))
train_datagen = ImageDataGenerator(
rescale=1.0/255,
data_format='channels_last',
rotation_range=30,
horizontal_flip=True,
fill_mode='reflect')
52
Transfer Learning
valid_datagen = ImageDataGenerator(
rescale=1.0/255,
data_format='channels_last')
train_generator = train_datagen.flow_from_directory(
directory='./hot-dog-not-hot-dog/train',
target_size=(224, 224),
classes=['hot_dog','not_hot_dog'],
class_mode='categorical',
batch_size=32,
shuffle=True,
seed=42)
valid_generator = valid_datagen.flow_from_directory(
directory='./hot-dog-not-hot-dog/test',
... ...)
53
Transfer Learning
# Compile the model for training:
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
VGGNet
Residual networks
Object detection
R-CNN approaches (R-CNN, Fast R-CNN, Faster R-CNN)
YOLO
Image segmentation
Transfer Learning
55