Download as pdf or txt
Download as pdf or txt
You are on page 1of 120

Introduction to the Course

and to Digital Images


Giacomo Boracchi,
giacomo.boracchi@polimi.it
Artificial Neural Networks and Deep Learning
Politecnico di Milano
https://boracchi.faculty.polimi.it/
AN2DL, Boracchi
Who I am
Giacomo Boracchi (giacomo.boracchi@polimi.it)
Mathematician (Università Statale degli Studi di Milano 2004),
PhD in Information Technology (DEIB, Politecnico di Milano 2008)
Associate Professor since 2019 at DEIB, Polimi (Computer Science)

My research interests are mathematical and statistical methods for:


• Image analysis and processing
• Machine Learning and in particular unsupervised learning, change and
anomaly detection
… and the two combined
AN2DL, Boracchi
Teaching
Advanced courses taught:
• Artificial Neural Networks and Deep Learning (MSc, Polimi)
• Mathematical Models and Methods for Image Processing (MSc, Polimi)
• Advanced Deep Learning Models And Methods (PhD, Polimi)
• Online Learning and Monitoring (PhD, Polimi)
• Learning Sparse Representations for image and signal modeling (PhD, Polimi
and TUNI)

• Computer Vision and Pattern Recognition (MSc in USI, Spring 2020)


• Image Analysis and Computer Vision (MSc, Polimi, Prof. Caglioti)
AN2DL, Boracchi
What is Computer Vision
About
Visual Recognition Problems

AN2DL, Boracchi
Computer Vision

An interdisciplinary scientific field that deals with how


computers can be made to gain high-level
understanding from digital images or videos

AN2DL, Boracchi
Computer Vision

An interdisciplinary scientific field that deals with how


computers can be made to gain high-level
understanding from digital images or videos
… which has grown incredibly fast in the last years

AN2DL, Boracchi
Object Detection

Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767. AN2DL, Boracchi
Object Detection

Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767. AN2DL, Boracchi
Pose Estimation

Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 7291-7299). AN2DL, Boracchi
Automatic Shelf Analysis

http:focal.systems AN2DL, Boracchi


Automatic Shelf Analysis

http:focal.systems AN2DL, Boracchi


Our Solution
Our Solution
Our Solution

Objects detected directly on a smartphone


device and using a single image of each
template
Quality Inspection Wafer Defect Map (WDM)
X Y

Inspection 1863 709


1346 3067
Tool 2858 17095
3392 3508
... ...
282 6532
892 18888
4427 9873

Silicon Wafer BasketBall ClusterBig ClusterSmall Donut Fingerprints GeometricScratch

Grid HalfMoon Incomplete Ring Slice ZigZag

Di Bella, Carrera, Rossi, Fragneto, Boracchi Wafer Defect Map Classification Using Sparse Convolutional Neural Networks ICIAP19
AN2DL, Boracchi
Our SSCN architecture
Perfoms both Classification and Novel-class detection

16
L. Frittoli, D. Carrera, B. Rossi, P. Fragneto "Deep Open-Set Recognition for Silicon Wafer Production Monitoring” under AN2DL,
reviewBoracchi
Image Captioning

"little girl is eating piece of cake." "black cat is sitting on top of suitcase."

Andrej Karpathy, Li Fei-Fei "Deep Visual-Semantic Alignments for Generating Image Descriptions" CVPR 2015 AN2DL, Boracchi
Even though sometimes it fails

AN2DL, Boracchi
CVPR
Attendance Trend

source: CVPR 2019 Welcome slides


year
AN2DL, Boracchi
Lately, connection with ML
There has seen a dramatic change in CV:
• Once, most of techniques and algorithms build upon a
mathematical/statistical description of images
• Nowadays, machine-learning methods are much more
popular

AN2DL, Boracchi
AN2DL, Boracchi
CVPR 2022

AN2DL, Boracchi
Today…
We will start fresh from the simplest visual recognition problem:
Image Classification

Then we will move to the most advanced ones…

But first, let’s have a look at images

AN2DL, Boracchi
Let’s start:
Digital Images

AN2DL, Boracchi
Photo Credits: Andrea Sanfilippo
AN2DL, Boracchi
RGB Images

𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶

𝐼𝐼 ∈ ℝ𝑅𝑅×𝐶𝐶×3
𝐺𝐺 ∈ ℝ𝑅𝑅×𝐶𝐶

𝐵𝐵 ∈ ℝ𝑅𝑅×𝐶𝐶 AN2DL, Boracchi


RGB Images
Images are saved by
encoding each color
information in 8 bits. So
images are rescaled and
casted in [0,255]

123 122 134 121 132

122 121 125 132 124

𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶 119 127 137 119 139


AN2DL, Boracchi
RGB Images [0, 205, 155] [15, 17, 19] [230,234, 233]

[253, 5, 6]
This is an RGB triplet
R = 253; G = 2; B = 6

[106,124,138]

AN2DL, Boracchi
The Input of our Neural Network!
Three dimensional arrays 𝐼𝐼 ∈ ℝ𝑅𝑅×𝐶𝐶×3

from skimage.io import imread

# Read the image


I = imread('bazar.jpg')

# Extract the color channels % Matlab


R = I[:, :, 0] R = I(:, :, 1)
G = I[:, :, 1] G = I(:, :, 2)
B = I[:, :, 2] B = I(:, :, 3)

When loaded in memory, image sizes are much larger than on the disk
where images are typically compressed (e.g. in jpeg format)
AN2DL, Boracchi
Videos

AN2DL, Boracchi
Higher dimensional images
Videos are sequences of images (frames)
If a frame is
𝐼𝐼 ∈ ℝ𝑅𝑅 × 𝐶𝐶 × 3 print(V.shape)
a video of T frames is (144, 180, 3, 30)
𝑉𝑉 ∈ ℝ𝑅𝑅 × 𝐶𝐶 × 3 × 𝑇𝑇

In this example: 𝑅𝑅 = 144, 𝐶𝐶 = 180, thus these 5 color frames contains:


388.800 values in [0,255], thus in principle, 388 KB
AN2DL, Boracchi
Dimension Increases very quickly
Without compression: 1Byte per color per pixel
1 frame in full HD: 𝑅𝑅 = 1080, 𝐶𝐶 = 1920 ≈ 6𝑀𝑀𝑀𝑀
1 sec in full HD (24fps) ≈ 150𝑀𝑀𝑀𝑀

Fortunately, visual data are very redundant, thus compressible

This has to be taken into account when you design a Machine learning
algorithm for images or vides
- e.g. during training a neural network these information are not
compressed!
AN2DL, Boracchi
Local (Spatial) Transformations:
Correlation
Most important image processing operations for
classification problems

AN2DL, Boracchi
Local (Spatial) Transformation
These transformations mix all the pixels locally “around“ the
neighborhood 𝑈𝑈 of a given pixel:
𝐺𝐺 𝑟𝑟, 𝑐𝑐 = 𝑇𝑇𝑈𝑈 𝐼𝐼 (𝑟𝑟, 𝑐𝑐)
Where
• 𝐼𝐼 is the input image
• 𝐺𝐺 is the output
• 𝑈𝑈 is a neighbourhood, identifies a region defining the output
• 𝑇𝑇𝑈𝑈 : ℝ3 → ℝ3 or 𝑇𝑇𝑈𝑈 : ℝ3 → ℝ is a the spatial transf. function

The output at pixel (𝑟𝑟, 𝑐𝑐) i.e., 𝑇𝑇𝑈𝑈 𝐼𝐼 (𝑟𝑟, 𝑐𝑐) is defined by all the intensity
values: 𝐼𝐼 𝑢𝑢, 𝑣𝑣 , 𝑢𝑢 − 𝑟𝑟, 𝑣𝑣 − 𝑐𝑐 ∈ 𝑈𝑈
AN2DL, Boracchi
Local (Spatial) Filters
The dashed square represents 𝐼𝐼 𝑢𝑢, 𝑣𝑣 , 𝑢𝑢 − 𝑟𝑟, 𝑣𝑣 − 𝑐𝑐 ∈ 𝑈𝑈
𝐼𝐼 𝑇𝑇𝑈𝑈 [𝐼𝐼]
(𝑟𝑟, 𝑐𝑐) (𝑟𝑟, 𝑐𝑐)

𝑈𝑈
𝑇𝑇𝑈𝑈 𝐼𝐼 (𝑟𝑟, 𝑐𝑐)

Input Image Output Image

AN2DL, Boracchi
Local (Spatial) Filters
The dashed square represents 𝐼𝐼 𝑢𝑢, 𝑣𝑣 , 𝑢𝑢 − 𝑟𝑟, 𝑣𝑣 − 𝑐𝑐 ∈ 𝑈𝑈
𝐼𝐼 𝑇𝑇𝑈𝑈 [𝐼𝐼]
(𝑟𝑟, 𝑐𝑐) 𝑇𝑇𝑈𝑈 (𝑟𝑟, 𝑐𝑐)

𝑈𝑈

Input Image Output Image

AN2DL, Boracchi
Local (Spatial) Filters
The dashed square represents 𝐼𝐼 𝑢𝑢, 𝑣𝑣 , 𝑢𝑢 − 𝑟𝑟, 𝑣𝑣 − 𝑐𝑐 ∈ 𝑈𝑈
𝐼𝐼 𝑇𝑇𝑈𝑈 [𝐼𝐼]
(𝑟𝑟, 𝑐𝑐) 𝑇𝑇𝑈𝑈 (𝑟𝑟, 𝑐𝑐)

𝑈𝑈

And (𝑢𝑢, 𝑣𝑣) has to be interpreted as a "displacement vector" w.r.t. the neighborhood
center (𝑟𝑟, 𝑐𝑐), e.g., 𝑢𝑢, 𝑣𝑣 ∈ { 1, −1 , 1,0 , (1, −1) … }
AN2DL, Boracchi
Local (Spatial) Filters

𝐼𝐼 𝑇𝑇𝑈𝑈 [𝐼𝐼]
(𝑟𝑟, 𝑐𝑐) 𝑇𝑇𝑈𝑈 (𝑟𝑟, 𝑐𝑐)

𝑈𝑈
(𝑟𝑟′, 𝑐𝑐′) 𝑇𝑇𝑈𝑈 (𝑟𝑟′, 𝑐𝑐′)

• Space invariant transformations are repeated for each pixel (don’t depend on 𝑟𝑟, 𝑐𝑐)
• 𝑇𝑇 can be either linear or nonlinear
AN2DL, Boracchi
Local Linear Filters
Linear Transformation: Linearity implies that the output 𝑇𝑇 𝐼𝐼 𝑟𝑟, 𝑐𝑐 is a
linear combination of the pixels in 𝑈𝑈:
𝑇𝑇 𝐼𝐼 𝑟𝑟, 𝑐𝑐 = � 𝑤𝑤𝑖𝑖 (𝑢𝑢, 𝑣𝑣) ∗ 𝐼𝐼(𝑟𝑟 + 𝑢𝑢, 𝑐𝑐 + 𝑣𝑣)
𝑢𝑢,𝑣𝑣 ∈𝑈𝑈
𝑐𝑐
Considering some weights {𝑤𝑤𝑖𝑖 }
We can consider weights as an
image, or a filter ℎ
𝑟𝑟
The filter ℎ entirely defines this
operation ℎ

AN2DL, Boracchi
Local Linear Filters
Linear Transformation: the filter weights can be assoicated to a matrix 𝒘𝒘
𝑇𝑇 𝐼𝐼 𝑟𝑟, 𝑐𝑐 = � 𝑤𝑤𝑖𝑖 (𝑢𝑢, 𝑣𝑣) ∗ 𝐼𝐼(𝑟𝑟 + 𝑢𝑢, 𝑐𝑐 + 𝑣𝑣)
𝐼𝐼
𝑢𝑢,𝑣𝑣 ∈𝑈𝑈 𝐼𝐼

𝒘𝒘

𝑤𝑤(−1, −1) 𝑤𝑤(−1,0) 𝑤𝑤(−1,1)


This operation is 𝐼𝐼
𝑇𝑇 𝐼𝐼
repeated for each 𝐼𝐼
𝑤𝑤(0, −1) 𝑤𝑤(0,0) 𝑤𝑤(0,1)
pixel in the input
image
𝑤𝑤(1, −11) 𝑤𝑤(1,0) 𝑤𝑤(1,1) 𝐼𝐼 𝐼𝐼

AN2DL, Boracchi
Correlation
The correlation among a filter 𝑤𝑤 = {𝑤𝑤𝑖𝑖𝑖𝑖 } and an image is defined as
𝐿𝐿 𝐿𝐿

𝐼𝐼 ⊗ 𝑤𝑤 (𝑟𝑟, 𝑐𝑐) = � � 𝑤𝑤(𝑢𝑢, 𝑣𝑣) ∗ 𝐼𝐼(𝑟𝑟 + 𝑢𝑢, 𝑐𝑐 + 𝑣𝑣)


𝑢𝑢=−𝐿𝐿 𝑣𝑣=−𝐿𝐿
where the filter ℎ is of size (2𝐿𝐿 + 1) × (2𝐿𝐿 + 1) and contains the weights
defined before as 𝑤𝑤. The filter 𝑤𝑤 is also sometimes called “kernel”

𝑤𝑤(−1, −1) 𝑤𝑤(−1,0) 𝑤𝑤(−1,1)

Point-wise 𝑤𝑤 0,1 ∗ 𝐼𝐼(𝑟𝑟 + 0 𝑐𝑐 + 1)


product 𝐼𝐼(𝑟𝑟, 𝑐𝑐) Sum
𝑤𝑤(0, −1) 𝑤𝑤(0,0) 𝑤𝑤(0,1)


𝑤𝑤(1, −1) 𝑤𝑤(1,0) 𝑤𝑤(1,1) 𝐼𝐼 ⊗ 𝑤𝑤 (𝑟𝑟, 𝑐𝑐)

𝑤𝑤 𝐼𝐼(𝑟𝑟 + 0 𝑐𝑐 + 1)
AN2DL, Boracchi
Correlation
acc = 0;
for i in np.arange(template_height)
for j in np.aragne(template_width)
acc = acc + image[y + i, x + j]*template[i,j]
image[x+ template_height//2, y + template_width//2] = acc x x+i

y+j

AN2DL, Boracchi
Correlation for BINARY target matching
𝐼𝐼 𝑤𝑤

⊗ =

Easy to understand with binary images


Target used as a filter
AN2DL, Boracchi
𝐼𝐼 𝑤𝑤

AN2DL, Boracchi
𝐼𝐼 𝑤𝑤

AN2DL, Boracchi
𝐼𝐼 𝑤𝑤

The maximum
is here

AN2DL, Boracchi
However…

* =

Each point in a white area is


as big as the template
achieve the maxium value
(togheter with the perfect
match)
AN2DL, Boracchi
However…

* =
Normalization is needed when using
correlation for template matching!
Each point in a white area is
as big as the template
achieve the maxium value
(togheter with the perfect
match)
AN2DL, Boracchi
The Image Classification Problem

AN2DL, Boracchi
Image Classification

Λ = {"wheel", "cars", …, "castle", "baboon"}

𝑥𝑥

“wheel”

𝑥𝑥

“castle” 52
Giacomo Boracchi, Reply AN2DL, Boracchi
November 2019
Image Classification

Λ = {"wheel", "cars", …, "castle", "baboon"}

𝑥𝑥

“wheel” 65%, “tyre” 30%..

𝑥𝑥

“castle” 55%, “tower” 43%.. 53


Giacomo Boracchi, Reply AN2DL, Boracchi
November 2019
Image Classification, the problem

Assign to an input image 𝑥𝑥 ∈ ℝ𝑅𝑅 × 𝐶𝐶 × 3 :


• a label 𝑦𝑦 from a fixed set of categories
Λ = {"wheel", "cars", …, "castle", "baboon"}

𝑥𝑥 → 𝑓𝑓𝜃𝜃 𝑥𝑥 ∈ Λ

54
AN2DL, Boracchi
Linear Classifier
The basic building block for deep architectures

AN2DL, Boracchi
How to feed images to NN?

… … …

𝑅𝑅×𝐶𝐶×3

𝐼𝐼 ∈ ℝ 𝐱𝐱 ∈ ℝ𝑑𝑑

input layer Hidden layer(s)


AN2DL, Boracchi
Column-wise unfolding Colors recall the color plane where
images are from

23

21
34

12
𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶 34

23
𝐺𝐺 ∈ ℝ𝑅𝑅×𝐶𝐶 𝒙𝒙 ∈ ℝ𝑑𝑑

𝐵𝐵 ∈ ℝ𝑅𝑅×𝐶𝐶 𝑑𝑑 = 𝑅𝑅 × 𝐶𝐶 × 3
AN2DL, Boracchi
Classification over the CIFAR-10 dataset
The CIFAR-10 dataset
contains 60000 images:
• Each image is 32x32 RGB
• Images are in 10 classes
• 6000 images per class

𝒙𝒙 ∈ ℝ𝑑𝑑 , 𝑑𝑑 = 3072

Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009. AN2DL, Boracchi
A 1-layer NN to Classify Images
𝑤𝑤𝑖𝑖,𝑗𝑗 is the weight associated to
the 𝑖𝑖-th neuron of the input when
𝑤𝑤1,1 computing the value at the 𝑗𝑗-th
𝑤𝑤1,2 output neuron

«Dog score»

𝐿𝐿 classes
𝑤𝑤𝐿𝐿,1 «Cat score»
𝑤𝑤1,𝑑𝑑
[…] […]

«Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑
𝑤𝑤𝐿𝐿,𝑑𝑑

input layer Output Layer AN2DL, Boracchi


A 1-layer NN to Classify Images

𝑤𝑤1,1
𝑠𝑠1 = � 𝑤𝑤1,𝑗𝑗 𝑥𝑥𝑗𝑗 + 𝑏𝑏1
𝑤𝑤1,2 𝑗𝑗=1:𝑑𝑑

«Dog score»

𝐿𝐿 classes
𝑤𝑤𝐿𝐿,1 «Cat score»
𝑤𝑤1,𝑑𝑑
[…] […]

«Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑
𝑤𝑤𝐿𝐿,𝑑𝑑

input layer Output Layer AN2DL, Boracchi


model.summary();
Layer (type) Output Shape Param #
=================================================================
Input (InputLayer) [(None, 32, 32, 3)] 0
_________________________________________________________________
Flatten (Flatten) (None, 3072) 0
_________________________________________________________________
Output (Dense) (None, 10) 30730
=================================================================
Total params: 30,730
Trainable params: 30,730
Non-trainable params: 0

AN2DL, Boracchi
Why don’t we take a larger network?
Dimensionality prevents us from using in a straightforward manner deep
NN as those seen so far.

Let’s take a network with an hidden layer having half of the neurons of
the input layer.

On CIFAR10 images, the number of neurons would be:


• 3072 first layer 1,536 * 3,072 + 1,536 = 4,720,128 parameters (!)
• 1536 second layer 10* 1,536 + 10 = 15,370 parameters
• 10 output layer
AN2DL, Boracchi
A 1-layer NN to Classify Images

𝑤𝑤1,1
𝑤𝑤1,2

«Dog score»

𝐿𝐿 classes
«Cat score»
𝑤𝑤𝐿𝐿,1
𝑤𝑤1,𝑑𝑑
… «Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑 𝑤𝑤𝐿𝐿,𝑑𝑑

input layer Output Layer


AN2DL, Boracchi
A 1-layer NN to Classify Images
Rmk: we can arrange weights in a matrix 𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑 , then the
scores of the 𝑖𝑖-th class is given by inner product between the
𝑤𝑤1,1 matrix rows 𝑊𝑊[𝑖𝑖, : ] and 𝒙𝒙. Scores then becomes:
𝑤𝑤1,2 𝑠𝑠𝑖𝑖 = 𝑊𝑊 𝑖𝑖, : ∗ 𝒙𝒙 + 𝑏𝑏𝑖𝑖
«Dog score»

𝐿𝐿 classes
«Cat score»
𝑤𝑤𝐿𝐿,1
𝑤𝑤1,𝑑𝑑
… «Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑 𝑤𝑤𝐿𝐿,𝑑𝑑

input layer Output Layer


AN2DL, Boracchi
A 1-layer NN to Classify Images
Rmk: we can arrange weights in a matrix 𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑 , then the
scores of the 𝑖𝑖-th class is given by inner product between the
𝑤𝑤1,1 matrix rows 𝑊𝑊[𝑖𝑖, : ] and 𝒙𝒙. Scores then becomes:
𝑤𝑤1,2 𝑠𝑠𝑖𝑖 = 𝑊𝑊 𝑖𝑖, : ∗ 𝒙𝒙 + 𝑏𝑏𝑖𝑖
«Dog score» Rmk: nonlinearity is

𝐿𝐿 classes
not needed here
«Cat score» since there are no
𝑤𝑤𝐿𝐿,1 layers following
𝑤𝑤1,𝑑𝑑
… «Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑 𝑤𝑤𝐿𝐿,𝑑𝑑

input layer Output Layer


AN2DL, Boracchi
A 1-layer NN to Classify Images
Rmk: we can arrange weights in a matrix 𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑 , then the
scores of the 𝑖𝑖-th class is given by inner product between the
𝑤𝑤1,1 matrix rows 𝑊𝑊[𝑖𝑖, : ] and 𝒙𝒙. Scores then becomes:
𝑤𝑤1,2 𝑠𝑠𝑖𝑖 = 𝑊𝑊 𝑖𝑖, : ∗ 𝒙𝒙 + 𝑏𝑏𝑖𝑖
«Dog score» Rmk: nonlinearity is

𝐿𝐿 classes
not needed here
«Cat score» since there are no
𝑤𝑤𝐿𝐿,1 layers following
𝑤𝑤1,𝑑𝑑
… «Horse score» Rmk: we can also ignore the
softmax in the output since
𝐱𝐱 ∈ ℝ𝑑𝑑 𝑤𝑤𝐿𝐿,𝑑𝑑
this would not change the
order of the scores (would
just normalize them)
input layer Output Layer
AN2DL, Boracchi
A 1-layer NN to Classify Images

𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 23 -2 -4 𝑠𝑠1 dog score
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 ∗ … + 32 = 22 𝑠𝑠2 cat score
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 21 -1 33 𝑠𝑠3 horse score
34
𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)

Rmk: colors indicate 12
to which color plane
34
in the image these
weights refer to …
23
𝒙𝒙
Unroll the image
column-wise
AN2DL, Boracchi
A 1-layer NN to Classify Images

𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 23 -2 -4 𝑠𝑠1 dog score
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 ∗ … + 32 = 22 𝑠𝑠2 cat score
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 21 -1 33 𝑠𝑠3 horse score
34
𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)

Rmk: colors indicate 12
to which color plane
34
in the image these
weights refer to …
23
𝒙𝒙
Unroll the image
column-wise
AN2DL, Boracchi
This simple layer is a linear classifier
In linear classification 𝒦𝒦 is a linear function:
𝒦𝒦 𝒙𝒙 = 𝑊𝑊𝑊𝑊 + 𝑏𝑏
where 𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑 , 𝑏𝑏 ∈ ℝ𝐿𝐿 are the parameters of the classifier 𝒦𝒦.
𝑊𝑊 are referred to as the weights, 𝑏𝑏 the bias vector.
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 23 -2 -4 𝑠𝑠1 dog score
∗ + = 22 𝑠𝑠2 cat score
𝐿𝐿

9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 … 32


1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 21 -1 33 𝑠𝑠3 rabbit score

𝑊𝑊 34 𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)



12
34

23 Unroll the image
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ 𝒙𝒙 column-wise AN2DL, Boracchi
This simple layer is a linear classifier
The classifier assign to an input image the class corresponding to the largest score
𝑦𝑦�𝑗𝑗 = argmax 𝑠𝑠𝑗𝑗
𝑖𝑖=1,..,𝐿𝐿 𝒊𝒊

being 𝒔𝒔𝑗𝑗 the 𝑖𝑖 −th component of the vector


𝒊𝒊
𝒦𝒦 𝒙𝒙 = 𝐬𝐬 = 𝑊𝑊𝒙𝒙 + 𝒃𝒃
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 23 -2 -4 𝑠𝑠1 dog score
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 ∗ … + 32 = 22 𝑠𝑠2 cat score
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 21 -1 33 𝑠𝑠3 horse score

𝑊𝑊 34 𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)



Rmk: softmax is not needed as long as we 12
take as output the largest score: this would be 34 𝒙𝒙
the one yielding the largest posterior …
23 AN2DL, Boracchi
The Parameters of a Linear Classifier
The score of a class is the weighted sum of all the image pixels.
Weights are actually the classifier parameters.
The weights are:

-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 -2


9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 32
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 -1

𝑊𝑊 𝒃𝒃

and indicate which are the most important pixels / colours


AN2DL, Boracchi
Why nonlinear layers?
Each layer in a NN can be seen as matrix multiplication (+ bias).
𝐬𝐬 = 𝑊𝑊𝒙𝒙 + 𝒃𝒃
If we stack 3 layers without activations:
𝐬𝐬 = (𝑊𝑊1 𝒙𝒙 + 𝒃𝒃1 )W2 + 𝒃𝒃2 𝑊𝑊3 + 𝑏𝑏3
This becomes equivalent to
𝐬𝐬 = 𝑊𝑊𝒙𝒙 + 𝒃𝒃

This is a further confirmation why it becomes pointless to stack many


layers without including a nonlinear activations…

AN2DL, Boracchi
Training the Linear Classifier

AN2DL, Boracchi
Training a Classifier
Given a training set 𝑇𝑇𝑇𝑇 and a loss function, define the parameters that
minimize the loss function over the whole 𝑇𝑇𝑇𝑇
In case of linear classifier
𝑊𝑊, 𝑏𝑏 = argmin � ℒ𝑊𝑊,𝑏𝑏 𝒙𝒙, 𝑦𝑦𝑖𝑖
𝑊𝑊∈ℝ𝐿𝐿×𝑑𝑑 , 𝑏𝑏∈ ℝ𝐿𝐿 𝒙𝒙𝑖𝑖 ,𝑦𝑦𝑖𝑖 ∈𝑇𝑇𝑇𝑇

Solving this minimization problem provides the weights of our classifier

AN2DL, Boracchi
Loss Function
Loss function: a function ℒ that measures our unhappiness with the
score assigned to training images
The loss ℒ will be high on a training image that is not correctly
classifier, low otherwise.

AN2DL, Boracchi
Loss Function Minimization
Loss function can be minimized by gradient descent and all its variants
(see Prof. Matteucci classes)

The loss function has to be typically regularized to achieve a unique


solution satisfying some desired property

𝑊𝑊, 𝑏𝑏 = argmin � ℒ𝑊𝑊,𝑏𝑏 𝒙𝒙, 𝑦𝑦𝑖𝑖 + 𝜆𝜆 ℛ(𝑊𝑊, 𝑏𝑏)


𝑊𝑊∈ℝ𝐿𝐿×𝑑𝑑 , 𝑏𝑏∈ ℝ𝐿𝐿 𝒙𝒙𝑖𝑖 ,𝑦𝑦𝑖𝑖 ∈𝑇𝑇𝑇𝑇

being 𝜆𝜆 > 0 a parameter balancing the two terms

AN2DL, Boracchi
… Once Trained
The training data is used to learn the parameters 𝑊𝑊, 𝒃𝒃
The classifier is expected to assign to the correct class a score that is
larger than that assigned to the incorrect classes.
Once the training is completed, it is possible to discard the whole
training set and keep only the learned parameters.

-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 -2


9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 32
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 -1

𝑊𝑊 𝒃𝒃

AN2DL, Boracchi
Geometric Interpretation of a Linear Classifier
𝐖𝐖[𝐢𝐢, : ] is a 𝑑𝑑 −dimensional vector containing the weights of the score
function for the 𝑖𝑖-th class.
Computing the score function for the 𝑖𝑖 −th class corresponds to computing
the inner product (and summing the bias)
𝑊𝑊 𝑖𝑖, : ∗ 𝒙𝒙 + 𝑏𝑏𝑖𝑖
Thus, the NN computes the inner products against 𝐿𝐿 different weights
vectors, and selects the one yielding the largest score (up to bias correction)

Rmk: these “inner product classifiers“ operate independently, and the output
of the 𝑗𝑗-th row is not influenced weights at a different row
Rmk: this would not be the case if the network had hidden layer that would
mix the outputs of intermediate layers
AN2DL, Boracchi
Geometric Interpretation of a Linear Classifier
In Python notation:
In Python ∗ denotes the element-wise product, here I mean the inner
product of vectors:
np. inner 𝑊𝑊 𝑖𝑖, : , 𝒙𝒙 + 𝑏𝑏𝑖𝑖

AN2DL, Boracchi
Geometric Interpretation
Interpret each image as a point in ℝ𝑑𝑑 .
Each classifier is a weighted sum of pixels, which corresponds to a
linear function in ℝ𝑑𝑑
In ℝ2 these would be
𝑓𝑓 𝑥𝑥1 , 𝑥𝑥2 = 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏
Then, points 𝑥𝑥1 , 𝑥𝑥2 yielding
𝑓𝑓 𝑥𝑥1 , 𝑥𝑥2 = 0
would be lines.
Thus, in ℝ2 the region that separates positive
from negative scores for each class is a line.
This region becomes an hyperplane in ℝ𝒅𝒅
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Template Matching Interpretation
In Python notation:
• 𝑊𝑊[𝑖𝑖, : ] is a 𝑑𝑑 −dimensional vector containing the weights of the score
function for the 𝑖𝑖 −th class
• Computing the score function for the 𝑖𝑖 −th class corresponds to computing
the inner product
𝑊𝑊 𝑖𝑖, : ∗ 𝒙𝒙 + 𝑏𝑏𝑖𝑖
Then, 𝑊𝑊[𝑖𝑖, : ] can be seen as a template used in matching (the output of
correlation in the central pixel)

The template 𝑊𝑊 𝑖𝑖, : is learned to match at best images belonging to the 𝑖𝑖 −th
class

Let’s have a look at these templates


AN2DL, Boracchi
Correlation between two RGB images
The image and the filter have the same size

AN2DL, Boracchi
Bring the classifier weights back to images
𝑊𝑊
𝑑𝑑 = 𝑅𝑅 × 𝐶𝐶 × 3
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 𝑊𝑊[𝑖𝑖, : ] ∈ ℝ𝑑𝑑 , car classifier

𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶 𝐺𝐺 ∈ ℝ𝑅𝑅×𝐶𝐶 𝐵𝐵 ∈ ℝ𝑅𝑅×𝐶𝐶 car template in ℝ𝑅𝑅×𝐶𝐶×3

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Templates Learned on the CIFAR-10 dataset

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
The Class Score
The classification score is then computed as the correlation between each input
image and the «template» of the coresponding class

⊗ ≫ ⊗

𝐼𝐼 ⊗ 𝑇𝑇1 0,0 = � 𝑇𝑇1 (𝑥𝑥, 𝑦𝑦) ∗ 𝐼𝐼(𝑥𝑥, 𝑦𝑦)


𝑥𝑥,𝑦𝑦 ∈𝑈𝑈
AN2DL, Boracchi
Templates Learned on the CIFAR-10 dataset

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Templates Learned on the CIFAR-10 dataset

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Templates Learned on the CIFAR-10 dataset

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Templates Learned on the CIFAR-10 dataset

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Linear Classifier as a Template Matching
What has the classifier learned?
• That the background of bird and frog is green, (plane and boat is blue)
• Cars are typically red
• Horses have two heads! 

The model was definitively too simple / data were not enough for achieving
higher performance and better templates
However:
• Linear Classifiers are among the most important layer of NN
• Such a simple model can be interpreted (with more sophisticated models you
typically can’t)
AN2DL, Boracchi
Linear Classifier as a Template Matching
What has the classifier learned?
• That the background of bird and frog is green, (plane and boat is blue)
• Cars are typically red
• Horses have two heads! 

The model was definitively too simple / data were not enough for achieving
higher performance and better templates
However:
• Linear Classifiers are among the most important layer of NN
• Such a simple model can be interpreted (with more sophisticated models you
typically can’t)
AN2DL, Boracchi
Do it yourself!
https://colab.research.google.com/drive/1kflPH3CDgnvk1JptU0Cbp2LK-
0w0h6R3?usp=sharing

Credits Eugenio Lomurno! (visualization with clipped colors)


AN2DL, Boracchi
Is Image Classification a
Challenging Problem?
Yes, it is…

AN2DL, Boracchi
First challenge: dimensionality

Images are very high-dimensional image data

AN2DL, Boracchi
CIFAR-10 dataset
The CIFAR-10 dataset
contains 60000 images:
Each image is 32x32 RGB
Images are in 10 classes
6000 images per class

Extremely small images, but


high-dimensional:
𝑑𝑑 = 32 × 32 × 3 = 3072
Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.
AN2DL, Boracchi
This resolution is by far smaller than what we are used to

𝒅𝒅 = 𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑

AN2DL, Boracchi
Former standard repository for ML research

𝒅𝒅 = 𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑

• 88% < 500 attributes


• 92% < 3.2K attributes
AN2DL, Boracchi
Former standard repository for ML research

𝒅𝒅 = 𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑

• 88% < 500 attributes


• 92% < 3.2K attributes
AN2DL, Boracchi
Second challenge: label ambiguity

A label might not uniquely identify the image

AN2DL, Boracchi
Second challenge: label ambiguity

Man?
Beer?
Dinner?
Restaurant?
Sausages?
….

AN2DL, Boracchi
Third challenge: transformations

There are many transformations that change


the image dramatically, while not its label

AN2DL, Boracchi
Changes in the Illumination Conditions

AN2DL, Boracchi
Giacomo Boracchi
Deformations

© Copyright Patrick Roper


Copyright Christine Matthews

AN2DL, Boracchi
View Point Change

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
… and many others

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Fourth challenge: inter-class variability

Images in the same class might be


dramatically different

AN2DL, Boracchi
Inter-class variability

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Fifth problem: perceptual similarity

Perceptual similarity in images is not


related to pixel-similarity

AN2DL, Boracchi
Nearest Neighborhood Classifiers for Images
Assign to each test image, the label of the closest image in the training set
𝑦𝑦�𝑗𝑗 = 𝑦𝑦𝑗𝑗 ∗ , being 𝑗𝑗 ∗ = argmin 𝑑𝑑( 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑖𝑖 )
𝑖𝑖=1…𝑁𝑁

Distances are typically measured as


𝟐𝟐
𝑑𝑑 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑖𝑖 = 𝒙𝒙𝒋𝒋 − 𝒙𝒙𝒊𝒊 = � 𝒙𝒙𝒋𝒋 − 𝒙𝒙𝒊𝒊 𝑘𝑘
𝟐𝟐 𝑘𝑘 𝑘𝑘

Or
𝑑𝑑 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑖𝑖 = 𝒙𝒙𝒋𝒋 − 𝒙𝒙𝒊𝒊 = � 𝒙𝒙𝒋𝒋 − 𝒙𝒙𝒊𝒊 𝑘𝑘
𝑘𝑘 𝑘𝑘

AN2DL, Boracchi
Pixel-wise distance among images

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
K-Nearest Neighborhood Classifiers for Images
Assign to each test image, the most frequent label among the
𝑲𝑲 −closest images in the training set
𝑦𝑦�𝑗𝑗 = 𝑦𝑦𝑗𝑗 ∗ , being 𝑗𝑗 ∗ the mode of 𝒰𝒰𝐾𝐾 𝒙𝒙𝑗𝑗
where 𝒰𝒰𝐾𝐾 (𝒙𝒙𝑗𝑗 ) contains the 𝐾𝐾 closest training images to 𝒙𝒙𝑗𝑗

Setting the parameter 𝐾𝐾 and the distance measure is an issue

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Nearest Neighborhood Classifier (𝑘𝑘-NN) for Images

1-

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
𝑘𝑘-NN for Images

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
𝑘𝑘-NN for Images

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
𝑘𝑘-NN for Images

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
𝑘𝑘-NN for Images
Pros:
• Easy to understand and implement
• It takes no training time
Cons:
• Computationally demanding at test time, when 𝑇𝑇𝑇𝑇 is large
and 𝑑𝑑 is also large.
• Large training sets must be stored in memory.
• Rarely practical on images: distances on high-dimensional
objects are difficult to interpret.
AN2DL, Boracchi
Perceptual Similarity vs Pixel Similarity

The three images have the same pixel-wise distance from the original
one...
…but perceptually they are very different
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Perceptual Similarity vs Pixel Similarity

𝑥𝑥1 𝑥𝑥2

𝑥𝑥𝒋𝒋 − 𝑥𝑥0 ≈ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑗𝑗 = 1,2,3


𝟐𝟐

𝑥𝑥0

𝑥𝑥3
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Let’s see what happens on the whole CIFAR10 using t-SNE

AN2DL, Boracchi
On CIFAR10 we see exactly this problem

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
On CIFAR10 we see exactly this problem

Using any pixel-wise


distance measure, and in
particular 𝑥𝑥1 − 𝑥𝑥0 𝟐𝟐 to
compare images is not
appropriate

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
On CIFAR10 we see exactly this problem

Some special model is needed to


handle images…
we’ll see in the next class!

CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi

You might also like