Unit 5a - Machine Vision

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Machine Vision

Chapter 10 from
Deep Learning Illustrated book

1
Biological Vision
 Hubel and Wiesel discovered that the neurons
that receive visual input from the eye are in
general most responsive to simple, straight
edges at particular, specific orientations.
 Fittingly, they named these cells simple neurons.
 A large group of simple neurons together is able
to represent all 360 degrees of orientation.
 These edge-orientation detecting simple cells
then pass along information to a large number of
so-called complex neurons.
 Capable of detecting complex shapes like a corner or
a curve. 2
Machine Vision

3
Convolutional Neural Networks
 The studies of visual cortex inspired the
neocognitron, introduced in 1980, which
gradually evolved into what we now call
convolutional neural networks.
 In 1998, Yann LeCun et al. introduced the
famous LeNet-5 architecture, widely used
by banks to recognize handwritten check
numbers.
 Introduced two new building blocks:
convolutional layers and pooling layers.
4
Machine Vision
 One of the advantages LeNet-5 had over the
neocognitron was a larger, high-quality set of training
data, MNIST.
 The next breakthrough in neural networks was also
facilitated by a high-quality public dataset, this time
much larger.
 The 14 million images in the ImageNet dataset are
spread across 22,000 categories as diverse as
container ships, leopards, starfish, and elderberries.
 Since 2010, Fei-Fei Li has run an open challenge
called ILSVRC on a subset of 1.4 million images
across 1,000 categories.
5
Machine Vision
 In the first two years of the ILSVRC all algorithms
hailed from the feature-engineering-driven ML.
 In the third year, all entrants except one were
traditional ML algorithms.
 If that one deep learning model in 2012 had not been
developed or if its creators had not competed in
ILSVRC, the year-over-year improvement in image
classification accuracy would have been negligible.
 Geoffrey Hinton’s group had blown the competition
out of water (AlexNet won with a huge margin of 8%).
 In an instant, deep learning architectures emerged
from the fringes of machine learning to its fore.
6
Machine Vision

7
Machine Vision
 Three principal factors enabled AlexNet to be the
state-of-the-art machine vision algorithm in 2012.
 First is the training data - they also artificially
expanded the already massive ImageNet data.
 Second is processing power - they programmed
two GPUs to train their large datasets with previously
unseen efficiency.
 Third is architectural advances - AlexNet is deeper
than LeNet-5, and it took advantage of both a new
type of artificial neuron and a nifty trick that helps
generalize deep learning models beyond the data
they’re trained on.
8
Convolution operation
Input
Kernel
a b c d
w x
e f g h
y z
i j k l

Output

aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz

ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz

9
Convolutional Neural Networks
 Convolution leverages three important
ideas that help improve machine learning
systems
1. Sparse interactions
2. Parameter sharing
3. Equivariant representations
 CNNs take advantage of spatial
information
 local patterns that are translation-invariant.
 and spatial hierarchies of these patterns.
10
Pooling Layers
 The pooling function replaces the output of
the net at a certain location with a
summary statistic of the nearby outputs
 Max pooling reports the maximum output
within a rectangular neighborhood
 Average pooling reports the average output
 Pooling helps make the representation
approximately invariant to small input
translations.
 Max pooling layer is the most commonly
used and performs better.
11
Pooling Layers
 Typically, a pooling layer has a filter size of 2 x 2
and a stride length of 2.
 In this case, at each position the pooling layer
evaluates four activations, retaining only the
maximum value.
 Thereby downsampling the activations by a
factor of 4.
 An alternative approach to pooling for reducing
computational complexity is to use a
convolutional layer with a larger stride that tend
to perform better without pooling layers.

12
Convolutional Filter Hyperparameters

 Kernel size
 Padding
 Stride length

13
CNNs popularity in Machine Vision
1. They allow deep learning models to learn to
recognize features in a position invariant
manner.
2. They remain faithful to the two-dimensional
structure of images, allowing features to be
identified within their spatial context.
3. They significantly reduce the number of
parameters required for modeling image data,
yielding higher computational efficiency.
4. Ultimately, they perform machine vision tasks
(e.g., image classification) more accurately.
14
LeNet-5 architecture
 The LeNet-5 architecture was created by Yann LeCun in
1998 and has been widely used for handwritten digit
recognition (MNIST).
 It introduced most of the elements in modern CNNs.

Layer Type Maps Size Kernel Stride Activation


Out Fully Con - 10 - - RBF
F6 Fully Con - 84 - - tanh
C5 Convolution 120 1x1 5x5 1 tanh
S4 Avg pooling 16 5x5 2x2 2 tanh
C3 Convolution 16 10 x 10 5x5 1 tanh
S2 Avg pooling 6 14 x 14 2x2 2 tanh
C1 Convolution 6 28 x 28 5x5 1 tanh
In Input 1 32 x 32 - - -
15
Alexnet (2012) architecture
 Similar to LeNet-5, only much larger and deeper.
Layer Type Maps Size Kernel Stride Padding Activation

Over 20,000,000 parameters!!!


Out Fully Con - 1000 - - - Softmax
F10 Fully Con - 4096 - - - ReLU
F9 Fully Con - 4096 - - - ReLU
S8 Max pool 256 6x6 3x3 2 Valid -
C7 Convolut. 256 13 x 13 3x3 1 Same ReLU
C6 Convolut. 384 13 x 13 3x3 1 Same ReLU
C5 Convolut. 384 13 x 13 3x3 1 Same ReLU
S4 Max pool 256 13 x 13 3x3 2 Valid -
C3 Convolut. 256 27 x 27 5x5 1 Same ReLU
S2 Max pool. 96 27 x 27 3x3 2 Valid -
C1 Convolut. 96 55 x 55 11 x 11 4 Valid ReLU
In Input 3 227x227 - - - - 16
Alexnet (2012) architecture
 First model to stack convolution layers on top of
each other.
 To reduce overfitting, the authors used two
regularization techniques.
 First, they applied dropout with a 50% dropout rate
during training to the outputs of layers F9 and F10.
 Second, they performed data augmentation by
randomly shifting the training images by various
offsets, flipping them horizontally, and changing the
lighting conditions.

17
Alexnet (2012) architecture
 We moved beyond monochromatic MNIST into full
colour images of larger size 227x227.
 AlexNet used larger filter sizes in the earliest
convolutional layers relative to what is popular
today—for example, kernel_size =(11, 11).
 A larger stride of 4 in the first convolution layer.
 Used ReLU activations rather than tanh.

18
VGG-16 architecture
 VGGNet follows the same repeated conv-
pool-block structure as AlexNet;
 VGGNet simply has more of them, and with
smaller (all 3x3) kernel sizes.

19
Vanishing Gradients: The Bête Noire of
Deep CNNs
 With more layers, models are able to learn
a larger variety of relatively low-level
features in the early layers, and
increasingly complex abstractions are
made possible in the later layers via
nonlinear recombination.
 This approach, however, has limits: If we
continue to simply make our networks
deeper, they will eventually be debilitated
by the vanishing gradient problem.
20
Vanishing Gradients: The Bête Noire of
Deep CNNs
 The basic issue of vanishing gradients is that
parameters in early layers of the network are far
away from the cost function: the source of the
gradient that is propagated backward through the
network.
 As the error is backpropagated, each layer closer
to the input gets a smaller and smaller update.
 The net effect is that early layers in increasingly
deep networks become more difficult to train.
 ResNet introduced an innovative solution to this
problem – skip (or residual) connections.
21
Residual Networks
 Residual Network (from Microsoft Research) is
the winner of the ILSVRC 2015 challenge.
 delivered an astounding error rate under 3.6%.

 The winning variant used an extremely deep


CNN composed of 152 layers.
 It confirmed the general trend: models are

getting deeper and deeper, with fewer and


fewer parameters.
 The key to being able to train such a deep
network is to use skip connections (also called
shortcut connections)
22
Residual Networks

23
Residual Networks
 The residual modules either learn something
useful and contribute to reducing the error of
the network, or they perform identity mapping
and do nothing at all.
 This neutral-or-better characteristic of residual
networks solves the issue of vanishing gradients.
 When several residual modules are stacked, the
later residual modules receive inputs that are
increasingly complex combinations of the
residual modules and skip connections from
earlier in the network.
24
Residual Networks

25
Residual Networks
 The residual connections managed to squeeze
out more juice relative to the existing networks
by enabling much deeper architectures without
the decrease in performance associated with
those extra layers if they fail to learn useful
information about the problem.
 In 2015, ResNet took first place not only in the
ILSVRC image-classification competition but in
the object detection and image segmentation
categories, too.

26
Applications of Machine Vision

 Object detection is tasked with drawing bounding boxes


around objects in an image.
 Semantic segmentation identifies all objects of a
particular class down to the pixel level.
 Instance segmentation discriminates between different
instances of a particular class, also at the pixel level. 27
Object Detection
 Object detection has broad applications,
such as detecting pedestrians in the field of
view for autonomous driving, or for
identifying anomalies in medical images.

 Object detection consists of two tasks:


 detection (identifying where the objects are)
 classification (identifying what the objects are)

28
Object Detection – R_CNN
 R-CNN was proposed in 2013 by Ross
Girshick and his colleagues at UC
Berkeley.
 The algorithm was modeled on the
attention mechanism of the human brain,
wherein an entire scene is scanned and
focus is placed on specific regions of
interest.

29
Object Detection – R_CNN
 To emulate this attention, R-CNNs:
1. Perform a selective search for regions of
interest (ROIs) within the image.
2. Extract features from these ROIs by using
a CNN.
3. Combine two “traditional” ML approaches
(linear regression and SVMs) to refine the
locations of bounding boxes and classify
objects within each of those boxes.
30
[Elgendy 20] Deep Learning for Vision Systems
31
Object Detection – R_CNN
Object Detection – R_CNN
 R-CNNs achieved a massive gain in performance
over the previous best model in the Pattern
Analysis, Statistical Modeling and Computational
Learning (PASCAL) Visual Object Classes (VOC)
competition.
 However, they have some limitations:
 It is inflexible: The input size was fixed to a single
specific image shape.
 It is slow and computationally expensive: Both
training and inference are multistage processes
involving CNNs, linear regression models, and SVMs.

32
Object Detection – Fast R_CNN
 To address the primary drawback of R-CNN—its
speed—Girshick went on to develop Fast R-
CNN.
 The chief innovation here was the realization that
during step 2 of the R-CNN algorithm, the CNN
was unnecessarily being run multiple times, once
for each region of interest.
 With Fast R-CNN, the ROI search (step 1) is run
as before, but during step 2, the CNN is given a
single global look at the image, and the extracted
features are used for all ROIs simultaneously.
33
Object Detection – Fast R_CNN
 A vector of features is extracted from the final
layer of the CNN, which (for step 3) is then fed
into a dense network along with the ROI.
 This dense net learns to focus on only the
features that apply to each individual ROI,
culminating in two outputs per ROI:
1. A softmax probability output over the
classification categories (for a prediction of
what class the detected object belongs to)
2. A bounding box regressor (for refinement of
the ROI’s location)
34
35
Object Detection – Fast R_CNN

[Elgendy 20] Deep Learning for Vision Systems


Object Detection – Fast R_CNN
 Following this approach, the Fast R-CNN
model has to perform feature extraction
using a CNN only once for a given image
(thereby reducing computational
complexity), and then the ROI search and
dense layers work together to finish the
object-detection task.
 As the name suggests, the reduced
computational complexity of Fast R-CNN
corresponds to speedier compute times.
36
Object Detection – Faster R_CNN
 Faster R-CNN was proposed in 2015 by
Shaoqing Ren and his coworkers at
Microsoft Research.
 To overcome the ROI-search bottleneck of
R-CNN and Fast R-CNN, Ren and his
colleagues had the cunning insight to
leverage the feature activation maps
from the model’s CNN for this step, too.
 Those activation maps contain a great deal
of contextual information about an image.
37
Object Detection – Faster R_CNN
 These feature maps contain rich detail
about what is in an image and where it is.
 Faster R-CNN takes advantage of this rich
detail to propose ROI locations, enabling a
CNN to seamlessly perform all three
steps of the object-detection process,
thereby providing a unified model
architecture that builds on R-CNN and Fast
R-CNN but is markedly quicker.

38
39
Object Detection – Faster R_CNN

[Elgendy 20] Deep Learning for Vision Systems


Object Detection – Faster R_CNN

[Elgendy 20] Deep Learning for Vision Systems


mean average
precision
(mAP)

40
Object Detection - YOLO
 The R-CNN approach focused on the individual
proposed ROIs as opposed to the whole input
image.
 Joseph Redmon and coworkers published on
You Only Look Once (YOLO) in 2015, which
bucked this trend.
 YOLO begins with a pretrained CNN for feature
extraction.
 Next, the image is divided into a series of cells,
and, for each cell, a number of bounding boxes
and object-classification probabilities are
predicted.
41
Object Detection - YOLO
 Bounding boxes with class probabilities above a
threshold value are selected, and these combine
to locate an object within an image.
 We can think of the YOLO method as
aggregating many smaller bounding boxes, but
only if they have a reasonably good probability of
containing any given object class.
 The algorithm improved on the speed of Faster
R-CNN, but it struggled to accurately detect small
objects in an image.
 Later versions of YOLO are a lot more accurate than
R-CNNs! 42
Object Detection
 The R-CNN family of networks has three main variations:
R-CNN, Fast R-CNN, and Faster R-CNN.
 R-CNN and Fast R-CNN use a selective search algorithm to
propose RoIs, whereas Faster R-CNN is an end-to-end DL
system that uses a region proposal network to propose RoIs.
 R-CNN approach is a multi-stage detector: it separates the
process to predict the objectness score of the bounding box and
the object class into two different stages.
 YOLO are single-stage detectors: the image is passed
once through the network to predict the objectness score
and the object class.
 In general, single-stage detectors tend to be less
accurate (early versions of YOLO) than two-stage
detectors but are significantly faster.
43
Image Segmentation

44
Image Segmentation - Mask R-CNN
 Mask R-CNN was developed by Facebook AI Research
(FAIR) in 2017. This approach involves:
1. Using the existing Faster R-CNN architecture to propose
ROIs within the image that are likely to contain objects.
2. An ROI classifier predicting what kind of object exists in
the bounding box while also refining the location and
size of the bounding box.
3. Using the bounding box to grab the parts of the feature
maps from the underlying CNN that correspond to that
part of the image.
4. Feeding the feature maps for each ROI into a fully
convolutional network that outputs a mask indicating
which pixels correspond to the object in the image.
45
Image Segmentation - Mask R-CNN
 Image segmentation problems require binary
masks as labels for training.
 These consist of arrays of the same dimensions
as the original image.
 However, instead of RGB pixel values they

contain 1s and 0s indicating where in the


image the object is, with the 1s representing a
given object’s pixel-by-pixel location (and the
0s representing everywhere else).
 If an image contains a dozen different objects,
then it must have a dozen binary masks.
46
Image Segmentation – U-Net
 Another popular image segmentation model is U-Net,
which was developed at the University of Freiberg (2015)
for the purpose of segmenting biomedical images.

47
Image Segmentation – U-Net
 The U-Net model consists of a fully convolutional
architecture, which begins with a contracting path
that produces successively smaller and deeper
activation maps through multiple convolution and
max-pooling steps.
 Subsequently, an expanding path restores these
deep activation maps back to full resolution
through multiple upsampling and convolution
steps.
 These two paths—the contracting and expanding
paths—are symmetrical (forming a “U” shape).
48
Image Segmentation – U-Net
 The contracting path serves to allow the model to
learn high-resolution features from the image.
 These high-res features are handed directly to
the expanding path (thru skip connection).
 After concatenating the feature maps from the
contracting path onto the expanding path, a
subsequent convolutional layer allows the
network to learn to assemble and localize these
features precisely.
 The final result is a network that is highly adept
both at identifying and locating those features.
49
Transfer Learning
 Transfer learning takes advantage of library of
existing visual elements contained within the
feature maps of a pretrained CNN and
repurposes them to become specialized in
identifying new classes of objects.
 4 tools in our deep learning toolbox
 Training from scratch on a small dataset

 Data augmentation to increase dataset size

 Feature extraction using a pretrained model

 Fine-tuning a pretrained model

50
Transfer Learning
 Now, we apply transfer learning with
VGG19 for hot dog detection.
vgg19 = VGG19(include_top=False,
weights='imagenet',
input_shape=(224,224,3),
pooling=None)
# Freeze all the layers in the base VGGNet19 model:
for layer in vgg19.layers:
layer.trainable = False

51
Transfer Learning
model = Sequential()
model.add(vgg19)
# Add the custom layers atop the VGG19 model:
model.add(Flatten(name='flattened'))
model.add(Dropout(0.5, name='dropout'))
model.add(Dense(2, activation='softmax', name='predictions'))

train_datagen = ImageDataGenerator(
rescale=1.0/255,
data_format='channels_last',
rotation_range=30,
horizontal_flip=True,
fill_mode='reflect')

52
Transfer Learning
valid_datagen = ImageDataGenerator(
rescale=1.0/255,
data_format='channels_last')
train_generator = train_datagen.flow_from_directory(
directory='./hot-dog-not-hot-dog/train',
target_size=(224, 224),
classes=['hot_dog','not_hot_dog'],
class_mode='categorical',
batch_size=32,
shuffle=True,
seed=42)
valid_generator = valid_datagen.flow_from_directory(
directory='./hot-dog-not-hot-dog/test',
... ...)
53
Transfer Learning
# Compile the model for training:
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])

# Train transfer-learning model:


model.fit_generator(train_generator, steps_per_epoch=15,
epochs=16, validation_data=valid_generator,
validation_steps=15)

 With a small amount of training and almost no time spent


on architectural considerations or hyperparameter tuning,
we have a model that performs reasonably well (81%
accuracy) on a rather complicated image classification
task: hot dog identification.
54
Chapter Summary
 Milestone CNN architectures
 LeNet-5
 AlexNet

 VGGNet

 Residual networks

 Object detection
 R-CNN approaches (R-CNN, Fast R-CNN, Faster R-CNN)

 YOLO

 Image segmentation
 Transfer Learning
55

You might also like