2023 AN2DL Lez 1 Image Classification

Introduction to the Course
and to Digital Images

Giacomo Boracchi,
giacomo.boracchi@polimi.it
Artificial Neural Networks and Deep Learning
Politecnico di Milano
https://boracchi.faculty.polimi.it/
AN2DL, Boracchi
Who I am
Giacomo Boracchi (giacomo.boracchi@polimi.it)
Mathematician (Università Statale degli Studi di Milano 2004),
PhD in Information Technology (DEIB, Politecnico di Milano 2008)
Associate Professor since 2019 at DEIB, Polimi (Computer Science)
My research interests are mathematical and statistical methods for:

• Image analysis and processing
• Machine Learning and in particular unsupervised learning, change and
anomaly detection
… and the two combined
AN2DL, Boracchi
Teaching
Advanced courses taught:
• Artificial Neural Networks and Deep Learning (MSc, Polimi)
• Mathematical Models and Methods for Image Processing (MSc, Polimi)
• Advanced Deep Learning Models And Methods (PhD, Polimi)
• Online Learning and Monitoring (PhD, Polimi)
• Learning Sparse Representations for image and signal modeling (PhD, Polimi
and TUNI)
• Computer Vision and Pattern Recognition (MSc in USI, Spring 2020)

• Image Analysis and Computer Vision (MSc, Polimi, Prof. Caglioti)
AN2DL, Boracchi
What is Computer Vision
About
Visual Recognition Problems
AN2DL, Boracchi
Computer Vision
An interdisciplinary scientific field that deals with how

computers can be made to gain high-level
understanding from digital images or videos
AN2DL, Boracchi
Computer Vision
An interdisciplinary scientific field that deals with how

computers can be made to gain high-level
understanding from digital images or videos
… which has grown incredibly fast in the last years
AN2DL, Boracchi
Object Detection
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767. AN2DL, Boracchi
Object Detection
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767. AN2DL, Boracchi
Pose Estimation
Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 7291-7299). AN2DL, Boracchi
Automatic Shelf Analysis
http:focal.systems AN2DL, Boracchi

Automatic Shelf Analysis
http:focal.systems AN2DL, Boracchi

Our Solution
Our Solution
Our Solution
Objects detected directly on a smartphone

device and using a single image of each
template
Quality Inspection Wafer Defect Map (WDM)
X Y
Inspection 1863 709

1346 3067
Tool 2858 17095
3392 3508
... ...
282 6532
892 18888
4427 9873
Silicon Wafer BasketBall ClusterBig ClusterSmall Donut Fingerprints GeometricScratch
Grid HalfMoon Incomplete Ring Slice ZigZag
Di Bella, Carrera, Rossi, Fragneto, Boracchi Wafer Defect Map Classification Using Sparse Convolutional Neural Networks ICIAP19
AN2DL, Boracchi
Our SSCN architecture
Perfoms both Classification and Novel-class detection
16
L. Frittoli, D. Carrera, B. Rossi, P. Fragneto "Deep Open-Set Recognition for Silicon Wafer Production Monitoring” under AN2DL,
reviewBoracchi
Image Captioning
"little girl is eating piece of cake." "black cat is sitting on top of suitcase."
Andrej Karpathy, Li Fei-Fei "Deep Visual-Semantic Alignments for Generating Image Descriptions" CVPR 2015 AN2DL, Boracchi
Even though sometimes it fails
AN2DL, Boracchi
CVPR
Attendance Trend
source: CVPR 2019 Welcome slides

year
AN2DL, Boracchi
Lately, connection with ML
There has seen a dramatic change in CV:
• Once, most of techniques and algorithms build upon a
mathematical/statistical description of images
• Nowadays, machine-learning methods are much more
popular
AN2DL, Boracchi
AN2DL, Boracchi
CVPR 2022
AN2DL, Boracchi
Today…
We will start fresh from the simplest visual recognition problem:
Image Classification
Then we will move to the most advanced ones…
But first, let’s have a look at images
AN2DL, Boracchi
Let’s start:
Digital Images
AN2DL, Boracchi
Photo Credits: Andrea Sanfilippo
AN2DL, Boracchi
RGB Images
𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶
𝐼𝐼 ∈ ℝ𝑅𝑅×𝐶𝐶×3
𝐺𝐺 ∈ ℝ𝑅𝑅×𝐶𝐶
𝐵𝐵 ∈ ℝ𝑅𝑅×𝐶𝐶 AN2DL, Boracchi

RGB Images
Images are saved by
encoding each color
information in 8 bits. So
images are rescaled and
casted in [0,255]
123 122 134 121 132
122 121 125 132 124
𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶 119 127 137 119 139

AN2DL, Boracchi
RGB Images [0, 205, 155] [15, 17, 19] [230,234, 233]
[253, 5, 6]
This is an RGB triplet
R = 253; G = 2; B = 6
[106,124,138]
AN2DL, Boracchi
The Input of our Neural Network!
Three dimensional arrays 𝐼𝐼 ∈ ℝ𝑅𝑅×𝐶𝐶×3
from skimage.io import imread
# Read the image

I = imread('bazar.jpg')
# Extract the color channels % Matlab

R = I[:, :, 0] R = I(:, :, 1)
G = I[:, :, 1] G = I(:, :, 2)
B = I[:, :, 2] B = I(:, :, 3)
When loaded in memory, image sizes are much larger than on the disk
where images are typically compressed (e.g. in jpeg format)
AN2DL, Boracchi
Videos
AN2DL, Boracchi
Higher dimensional images
Videos are sequences of images (frames)
If a frame is
𝐼𝐼 ∈ ℝ𝑅𝑅 × 𝐶𝐶 × 3 print(V.shape)
a video of T frames is (144, 180, 3, 30)
𝑉𝑉 ∈ ℝ𝑅𝑅 × 𝐶𝐶 × 3 × 𝑇𝑇
In this example: 𝑅𝑅 = 144, 𝐶𝐶 = 180, thus these 5 color frames contains:

388.800 values in [0,255], thus in principle, 388 KB
AN2DL, Boracchi
Dimension Increases very quickly
Without compression: 1Byte per color per pixel
1 frame in full HD: 𝑅𝑅 = 1080, 𝐶𝐶 = 1920 ≈ 6𝑀𝑀𝑀𝑀
1 sec in full HD (24fps) ≈ 150𝑀𝑀𝑀𝑀
Fortunately, visual data are very redundant, thus compressible
This has to be taken into account when you design a Machine learning
algorithm for images or vides
- e.g. during training a neural network these information are not
compressed!
AN2DL, Boracchi
Local (Spatial) Transformations:
Correlation
Most important image processing operations for
classification problems
AN2DL, Boracchi
Local (Spatial) Transformation
These transformations mix all the pixels locally “around“ the
neighborhood 𝑈𝑈 of a given pixel:
𝐺𝐺 𝑟𝑟, 𝑐𝑐 = 𝑇𝑇𝑈𝑈 𝐼𝐼 (𝑟𝑟, 𝑐𝑐)
Where
• 𝐼𝐼 is the input image
• 𝐺𝐺 is the output
• 𝑈𝑈 is a neighbourhood, identifies a region defining the output
• 𝑇𝑇𝑈𝑈 : ℝ3 → ℝ3 or 𝑇𝑇𝑈𝑈 : ℝ3 → ℝ is a the spatial transf. function
The output at pixel (𝑟𝑟, 𝑐𝑐) i.e., 𝑇𝑇𝑈𝑈 𝐼𝐼 (𝑟𝑟, 𝑐𝑐) is defined by all the intensity
values: 𝐼𝐼 𝑢𝑢, 𝑣𝑣 , 𝑢𝑢 − 𝑟𝑟, 𝑣𝑣 − 𝑐𝑐 ∈ 𝑈𝑈
AN2DL, Boracchi
Local (Spatial) Filters
The dashed square represents 𝐼𝐼 𝑢𝑢, 𝑣𝑣 , 𝑢𝑢 − 𝑟𝑟, 𝑣𝑣 − 𝑐𝑐 ∈ 𝑈𝑈
𝐼𝐼 𝑇𝑇𝑈𝑈 [𝐼𝐼]
(𝑟𝑟, 𝑐𝑐) (𝑟𝑟, 𝑐𝑐)
𝑈𝑈
𝑇𝑇𝑈𝑈 𝐼𝐼 (𝑟𝑟, 𝑐𝑐)
Input Image Output Image
AN2DL, Boracchi
(𝑟𝑟, 𝑐𝑐) 𝑇𝑇𝑈𝑈 (𝑟𝑟, 𝑐𝑐)
𝑈𝑈
Input Image Output Image
AN2DL, Boracchi
𝑈𝑈
And (𝑢𝑢, 𝑣𝑣) has to be interpreted as a "displacement vector" w.r.t. the neighborhood
center (𝑟𝑟, 𝑐𝑐), e.g., 𝑢𝑢, 𝑣𝑣 ∈ { 1, −1 , 1,0 , (1, −1) … }
AN2DL, Boracchi
𝑈𝑈
(𝑟𝑟′, 𝑐𝑐′) 𝑇𝑇𝑈𝑈 (𝑟𝑟′, 𝑐𝑐′)
• Space invariant transformations are repeated for each pixel (don’t depend on 𝑟𝑟, 𝑐𝑐)
• 𝑇𝑇 can be either linear or nonlinear
AN2DL, Boracchi
Local Linear Filters
Linear Transformation: Linearity implies that the output 𝑇𝑇 𝐼𝐼 𝑟𝑟, 𝑐𝑐 is a
linear combination of the pixels in 𝑈𝑈:
𝑇𝑇 𝐼𝐼 𝑟𝑟, 𝑐𝑐 = � 𝑤𝑤𝑖𝑖 (𝑢𝑢, 𝑣𝑣) ∗ 𝐼𝐼(𝑟𝑟 + 𝑢𝑢, 𝑐𝑐 + 𝑣𝑣)
𝑢𝑢,𝑣𝑣 ∈𝑈𝑈
𝑐𝑐
Considering some weights {𝑤𝑤𝑖𝑖 }
We can consider weights as an
image, or a filter ℎ
𝑟𝑟
The filter ℎ entirely defines this
operation ℎ
AN2DL, Boracchi
Local Linear Filters
Linear Transformation: the filter weights can be assoicated to a matrix 𝒘𝒘
𝑇𝑇 𝐼𝐼 𝑟𝑟, 𝑐𝑐 = � 𝑤𝑤𝑖𝑖 (𝑢𝑢, 𝑣𝑣) ∗ 𝐼𝐼(𝑟𝑟 + 𝑢𝑢, 𝑐𝑐 + 𝑣𝑣)
𝐼𝐼
𝑢𝑢,𝑣𝑣 ∈𝑈𝑈 𝐼𝐼
𝒘𝒘
𝑤𝑤(−1, −1) 𝑤𝑤(−1,0) 𝑤𝑤(−1,1)

This operation is 𝐼𝐼
𝑇𝑇 𝐼𝐼
repeated for each 𝐼𝐼
𝑤𝑤(0, −1) 𝑤𝑤(0,0) 𝑤𝑤(0,1)
pixel in the input
image
𝑤𝑤(1, −11) 𝑤𝑤(1,0) 𝑤𝑤(1,1) 𝐼𝐼 𝐼𝐼
AN2DL, Boracchi
Correlation
The correlation among a filter 𝑤𝑤 = {𝑤𝑤𝑖𝑖𝑖𝑖 } and an image is defined as
𝐿𝐿 𝐿𝐿
𝐼𝐼 ⊗ 𝑤𝑤 (𝑟𝑟, 𝑐𝑐) = � � 𝑤𝑤(𝑢𝑢, 𝑣𝑣) ∗ 𝐼𝐼(𝑟𝑟 + 𝑢𝑢, 𝑐𝑐 + 𝑣𝑣)

𝑢𝑢=−𝐿𝐿 𝑣𝑣=−𝐿𝐿
where the filter ℎ is of size (2𝐿𝐿 + 1) × (2𝐿𝐿 + 1) and contains the weights
defined before as 𝑤𝑤. The filter 𝑤𝑤 is also sometimes called “kernel”
𝑤𝑤(−1, −1) 𝑤𝑤(−1,0) 𝑤𝑤(−1,1)
Point-wise 𝑤𝑤 0,1 ∗ 𝐼𝐼(𝑟𝑟 + 0 𝑐𝑐 + 1)

product 𝐼𝐼(𝑟𝑟, 𝑐𝑐) Sum
𝑤𝑤(0, −1) 𝑤𝑤(0,0) 𝑤𝑤(0,1)
∗
𝑤𝑤(1, −1) 𝑤𝑤(1,0) 𝑤𝑤(1,1) 𝐼𝐼 ⊗ 𝑤𝑤 (𝑟𝑟, 𝑐𝑐)
𝑤𝑤 𝐼𝐼(𝑟𝑟 + 0 𝑐𝑐 + 1)
AN2DL, Boracchi
Correlation
acc = 0;
for i in np.arange(template_height)
for j in np.aragne(template_width)
acc = acc + image[y + i, x + j]*template[i,j]
image[x+ template_height//2, y + template_width//2] = acc x x+i
y+j
AN2DL, Boracchi
Correlation for BINARY target matching
𝐼𝐼 𝑤𝑤
⊗ =
Easy to understand with binary images

Target used as a filter
AN2DL, Boracchi
𝐼𝐼 𝑤𝑤
⊗
AN2DL, Boracchi
𝐼𝐼 𝑤𝑤
⊗
AN2DL, Boracchi
𝐼𝐼 𝑤𝑤
⊗
The maximum
is here
AN2DL, Boracchi
However…
* =
Each point in a white area is

as big as the template
achieve the maxium value
(togheter with the perfect
match)
AN2DL, Boracchi
However…
* =
Normalization is needed when using
correlation for template matching!
Each point in a white area is
as big as the template
achieve the maxium value
(togheter with the perfect
match)
AN2DL, Boracchi
The Image Classification Problem
AN2DL, Boracchi
Λ = {"wheel", "cars", …, "castle", "baboon"}
𝑥𝑥
“wheel”
𝑥𝑥
“castle” 52
Giacomo Boracchi, Reply AN2DL, Boracchi
November 2019
𝑥𝑥
“wheel” 65%, “tyre” 30%..
𝑥𝑥
“castle” 55%, “tower” 43%.. 53

Giacomo Boracchi, Reply AN2DL, Boracchi
November 2019
Image Classification, the problem
Assign to an input image 𝑥𝑥 ∈ ℝ𝑅𝑅 × 𝐶𝐶 × 3 :

• a label 𝑦𝑦 from a fixed set of categories
𝑥𝑥 → 𝑓𝑓𝜃𝜃 𝑥𝑥 ∈ Λ
54
AN2DL, Boracchi
Linear Classifier
The basic building block for deep architectures
AN2DL, Boracchi
How to feed images to NN?
… … …
𝑅𝑅×𝐶𝐶×3
…
𝐼𝐼 ∈ ℝ 𝐱𝐱 ∈ ℝ𝑑𝑑
input layer Hidden layer(s)

AN2DL, Boracchi
Column-wise unfolding Colors recall the color plane where
images are from
23
…
21
34
…
12
𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶 34
…
23
𝐺𝐺 ∈ ℝ𝑅𝑅×𝐶𝐶 𝒙𝒙 ∈ ℝ𝑑𝑑
𝐵𝐵 ∈ ℝ𝑅𝑅×𝐶𝐶 𝑑𝑑 = 𝑅𝑅 × 𝐶𝐶 × 3
AN2DL, Boracchi
Classification over the CIFAR-10 dataset
The CIFAR-10 dataset
contains 60000 images:
• Each image is 32x32 RGB
• Images are in 10 classes
• 6000 images per class
𝒙𝒙 ∈ ℝ𝑑𝑑 , 𝑑𝑑 = 3072
Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009. AN2DL, Boracchi
A 1-layer NN to Classify Images
𝑤𝑤𝑖𝑖,𝑗𝑗 is the weight associated to
the 𝑖𝑖-th neuron of the input when
𝑤𝑤1,1 computing the value at the 𝑗𝑗-th
𝑤𝑤1,2 output neuron
«Dog score»
𝐿𝐿 classes
𝑤𝑤𝐿𝐿,1 «Cat score»
𝑤𝑤1,𝑑𝑑
[…] […]
…
«Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑
𝑤𝑤𝐿𝐿,𝑑𝑑
input layer Output Layer AN2DL, Boracchi

𝑤𝑤1,1
𝑠𝑠1 = � 𝑤𝑤1,𝑗𝑗 𝑥𝑥𝑗𝑗 + 𝑏𝑏1
𝑤𝑤1,2 𝑗𝑗=1:𝑑𝑑
«Dog score»
𝐿𝐿 classes
𝑤𝑤𝐿𝐿,1 «Cat score»
𝑤𝑤1,𝑑𝑑
[…] […]
…
«Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑
𝑤𝑤𝐿𝐿,𝑑𝑑
input layer Output Layer AN2DL, Boracchi

model.summary();
Layer (type) Output Shape Param #
=================================================================
Input (InputLayer) [(None, 32, 32, 3)] 0
_________________________________________________________________
Flatten (Flatten) (None, 3072) 0
_________________________________________________________________
Output (Dense) (None, 10) 30730
=================================================================
Total params: 30,730
Trainable params: 30,730
Non-trainable params: 0
AN2DL, Boracchi
Why don’t we take a larger network?
Dimensionality prevents us from using in a straightforward manner deep
NN as those seen so far.
Let’s take a network with an hidden layer having half of the neurons of
the input layer.
On CIFAR10 images, the number of neurons would be:

• 3072 first layer 1,536 * 3,072 + 1,536 = 4,720,128 parameters (!)
• 1536 second layer 10* 1,536 + 10 = 15,370 parameters
• 10 output layer
AN2DL, Boracchi
𝑤𝑤1,1
𝑤𝑤1,2
«Dog score»
𝐿𝐿 classes
«Cat score»
𝑤𝑤𝐿𝐿,1
𝑤𝑤1,𝑑𝑑
… «Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑 𝑤𝑤𝐿𝐿,𝑑𝑑
input layer Output Layer

AN2DL, Boracchi
Rmk: we can arrange weights in a matrix 𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑 , then the
scores of the 𝑖𝑖-th class is given by inner product between the
𝑤𝑤1,1 matrix rows 𝑊𝑊[𝑖𝑖, : ] and 𝒙𝒙. Scores then becomes:
𝑤𝑤1,2 𝑠𝑠𝑖𝑖 = 𝑊𝑊 𝑖𝑖, : ∗ 𝒙𝒙 + 𝑏𝑏𝑖𝑖
«Dog score»
𝐿𝐿 classes
«Cat score»
𝑤𝑤𝐿𝐿,1
𝑤𝑤1,𝑑𝑑
… «Horse score»

AN2DL, Boracchi
«Dog score» Rmk: nonlinearity is
𝐿𝐿 classes
not needed here
«Cat score» since there are no
𝑤𝑤𝐿𝐿,1 layers following
𝑤𝑤1,𝑑𝑑
… «Horse score»

AN2DL, Boracchi
«Dog score» Rmk: nonlinearity is
𝐿𝐿 classes
not needed here
«Cat score» since there are no
𝑤𝑤𝐿𝐿,1 layers following
𝑤𝑤1,𝑑𝑑
… «Horse score» Rmk: we can also ignore the
softmax in the output since
this would not change the
order of the scores (would
just normalize them)
AN2DL, Boracchi
𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 23 -2 -4 𝑠𝑠1 dog score
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 ∗ … + 32 = 22 𝑠𝑠2 cat score
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 21 -1 33 𝑠𝑠3 horse score
34
𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)
…
Rmk: colors indicate 12
to which color plane
34
in the image these
weights refer to …
23
𝒙𝒙
Unroll the image
column-wise
AN2DL, Boracchi
𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 23 -2 -4 𝑠𝑠1 dog score
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 ∗ … + 32 = 22 𝑠𝑠2 cat score
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 21 -1 33 𝑠𝑠3 horse score
34
𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)
…
Rmk: colors indicate 12
to which color plane
34
in the image these
weights refer to …
23
𝒙𝒙
Unroll the image
column-wise
AN2DL, Boracchi
This simple layer is a linear classifier
In linear classification 𝒦𝒦 is a linear function:
𝒦𝒦 𝒙𝒙 = 𝑊𝑊𝑊𝑊 + 𝑏𝑏
where 𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑 , 𝑏𝑏 ∈ ℝ𝐿𝐿 are the parameters of the classifier 𝒦𝒦.
𝑊𝑊 are referred to as the weights, 𝑏𝑏 the bias vector.
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 23 -2 -4 𝑠𝑠1 dog score
∗ + = 22 𝑠𝑠2 cat score
𝐿𝐿
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 … 32

1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 21 -1 33 𝑠𝑠3 rabbit score
𝑊𝑊 34 𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)

…
12
34
…
23 Unroll the image
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ 𝒙𝒙 column-wise AN2DL, Boracchi
This simple layer is a linear classifier
The classifier assign to an input image the class corresponding to the largest score
𝑦𝑦�𝑗𝑗 = argmax 𝑠𝑠𝑗𝑗
𝑖𝑖=1,..,𝐿𝐿 𝒊𝒊
being 𝒔𝒔𝑗𝑗 the 𝑖𝑖 −th component of the vector

𝒊𝒊
𝒦𝒦 𝒙𝒙 = 𝐬𝐬 = 𝑊𝑊𝒙𝒙 + 𝒃𝒃
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 23 -2 -4 𝑠𝑠1 dog score
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 ∗ … + 32 = 22 𝑠𝑠2 cat score
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 21 -1 33 𝑠𝑠3 horse score
𝑊𝑊 34 𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)

…
Rmk: softmax is not needed as long as we 12
take as output the largest score: this would be 34 𝒙𝒙
the one yielding the largest posterior …
23 AN2DL, Boracchi
The Parameters of a Linear Classifier
The score of a class is the weighted sum of all the image pixels.
Weights are actually the classifier parameters.
The weights are:
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 -2

9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 32
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 -1
𝑊𝑊 𝒃𝒃
and indicate which are the most important pixels / colours

AN2DL, Boracchi
Why nonlinear layers?
Each layer in a NN can be seen as matrix multiplication (+ bias).
𝐬𝐬 = 𝑊𝑊𝒙𝒙 + 𝒃𝒃
If we stack 3 layers without activations:
𝐬𝐬 = (𝑊𝑊1 𝒙𝒙 + 𝒃𝒃1 )W2 + 𝒃𝒃2 𝑊𝑊3 + 𝑏𝑏3
This becomes equivalent to
𝐬𝐬 = 𝑊𝑊𝒙𝒙 + 𝒃𝒃
This is a further confirmation why it becomes pointless to stack many

layers without including a nonlinear activations…
AN2DL, Boracchi
Training the Linear Classifier
AN2DL, Boracchi
Training a Classifier
Given a training set 𝑇𝑇𝑇𝑇 and a loss function, define the parameters that
minimize the loss function over the whole 𝑇𝑇𝑇𝑇
In case of linear classifier
𝑊𝑊, 𝑏𝑏 = argmin � ℒ𝑊𝑊,𝑏𝑏 𝒙𝒙, 𝑦𝑦𝑖𝑖
𝑊𝑊∈ℝ𝐿𝐿×𝑑𝑑 , 𝑏𝑏∈ ℝ𝐿𝐿 𝒙𝒙𝑖𝑖 ,𝑦𝑦𝑖𝑖 ∈𝑇𝑇𝑇𝑇
Solving this minimization problem provides the weights of our classifier
AN2DL, Boracchi
Loss Function
Loss function: a function ℒ that measures our unhappiness with the
score assigned to training images
The loss ℒ will be high on a training image that is not correctly
classifier, low otherwise.
AN2DL, Boracchi
Loss Function Minimization
Loss function can be minimized by gradient descent and all its variants
(see Prof. Matteucci classes)
The loss function has to be typically regularized to achieve a unique

solution satisfying some desired property
𝑊𝑊, 𝑏𝑏 = argmin � ℒ𝑊𝑊,𝑏𝑏 𝒙𝒙, 𝑦𝑦𝑖𝑖 + 𝜆𝜆 ℛ(𝑊𝑊, 𝑏𝑏)

𝑊𝑊∈ℝ𝐿𝐿×𝑑𝑑 , 𝑏𝑏∈ ℝ𝐿𝐿 𝒙𝒙𝑖𝑖 ,𝑦𝑦𝑖𝑖 ∈𝑇𝑇𝑇𝑇
being 𝜆𝜆 > 0 a parameter balancing the two terms
AN2DL, Boracchi
… Once Trained
The training data is used to learn the parameters 𝑊𝑊, 𝒃𝒃
The classifier is expected to assign to the correct class a score that is
larger than that assigned to the incorrect classes.
Once the training is completed, it is possible to discard the whole
training set and keep only the learned parameters.
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 -2

9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 32
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 -1
𝑊𝑊 𝒃𝒃
AN2DL, Boracchi
Geometric Interpretation of a Linear Classifier
𝐖𝐖[𝐢𝐢, : ] is a 𝑑𝑑 −dimensional vector containing the weights of the score
function for the 𝑖𝑖-th class.
Computing the score function for the 𝑖𝑖 −th class corresponds to computing
the inner product (and summing the bias)
𝑊𝑊 𝑖𝑖, : ∗ 𝒙𝒙 + 𝑏𝑏𝑖𝑖
Thus, the NN computes the inner products against 𝐿𝐿 different weights
vectors, and selects the one yielding the largest score (up to bias correction)
Rmk: these “inner product classifiers“ operate independently, and the output
of the 𝑗𝑗-th row is not influenced weights at a different row
Rmk: this would not be the case if the network had hidden layer that would
mix the outputs of intermediate layers
AN2DL, Boracchi
Geometric Interpretation of a Linear Classifier
In Python notation:
In Python ∗ denotes the element-wise product, here I mean the inner
product of vectors:
np. inner 𝑊𝑊 𝑖𝑖, : , 𝒙𝒙 + 𝑏𝑏𝑖𝑖
AN2DL, Boracchi
Geometric Interpretation
Interpret each image as a point in ℝ𝑑𝑑 .
Each classifier is a weighted sum of pixels, which corresponds to a
linear function in ℝ𝑑𝑑
In ℝ2 these would be
𝑓𝑓 𝑥𝑥1 , 𝑥𝑥2 = 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏
Then, points 𝑥𝑥1 , 𝑥𝑥2 yielding
𝑓𝑓 𝑥𝑥1 , 𝑥𝑥2 = 0
would be lines.
Thus, in ℝ2 the region that separates positive
from negative scores for each class is a line.
This region becomes an hyperplane in ℝ𝒅𝒅
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Template Matching Interpretation
In Python notation:
• 𝑊𝑊[𝑖𝑖, : ] is a 𝑑𝑑 −dimensional vector containing the weights of the score
function for the 𝑖𝑖 −th class
• Computing the score function for the 𝑖𝑖 −th class corresponds to computing
the inner product
𝑊𝑊 𝑖𝑖, : ∗ 𝒙𝒙 + 𝑏𝑏𝑖𝑖
Then, 𝑊𝑊[𝑖𝑖, : ] can be seen as a template used in matching (the output of
correlation in the central pixel)
The template 𝑊𝑊 𝑖𝑖, : is learned to match at best images belonging to the 𝑖𝑖 −th
class
Let’s have a look at these templates

AN2DL, Boracchi
Correlation between two RGB images
The image and the filter have the same size
AN2DL, Boracchi
Bring the classifier weights back to images
𝑊𝑊
𝑑𝑑 = 𝑅𝑅 × 𝐶𝐶 × 3
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 𝑊𝑊[𝑖𝑖, : ] ∈ ℝ𝑑𝑑 , car classifier
𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶 𝐺𝐺 ∈ ℝ𝑅𝑅×𝐶𝐶 𝐵𝐵 ∈ ℝ𝑅𝑅×𝐶𝐶 car template in ℝ𝑅𝑅×𝐶𝐶×3
Templates Learned on the CIFAR-10 dataset
The Class Score
The classification score is then computed as the correlation between each input
image and the «template» of the coresponding class
⊗ ≫ ⊗
𝐼𝐼 ⊗ 𝑇𝑇1 0,0 = � 𝑇𝑇1 (𝑥𝑥, 𝑦𝑦) ∗ 𝐼𝐼(𝑥𝑥, 𝑦𝑦)

𝑥𝑥,𝑦𝑦 ∈𝑈𝑈
AN2DL, Boracchi
Linear Classifier as a Template Matching
What has the classifier learned?
• That the background of bird and frog is green, (plane and boat is blue)
• Cars are typically red
• Horses have two heads! 
The model was definitively too simple / data were not enough for achieving
higher performance and better templates
However:
• Linear Classifiers are among the most important layer of NN
• Such a simple model can be interpreted (with more sophisticated models you
typically can’t)
AN2DL, Boracchi
Linear Classifier as a Template Matching
What has the classifier learned?
• That the background of bird and frog is green, (plane and boat is blue)
• Cars are typically red
• Horses have two heads! 
The model was definitively too simple / data were not enough for achieving
higher performance and better templates
However:
• Linear Classifiers are among the most important layer of NN
• Such a simple model can be interpreted (with more sophisticated models you
typically can’t)
AN2DL, Boracchi
Do it yourself!
https://colab.research.google.com/drive/1kflPH3CDgnvk1JptU0Cbp2LK-
0w0h6R3?usp=sharing
Credits Eugenio Lomurno! (visualization with clipped colors)

AN2DL, Boracchi
Is Image Classification a
Challenging Problem?
Yes, it is…
AN2DL, Boracchi
First challenge: dimensionality
Images are very high-dimensional image data
AN2DL, Boracchi
CIFAR-10 dataset
The CIFAR-10 dataset
contains 60000 images:
Each image is 32x32 RGB
Images are in 10 classes
6000 images per class
Extremely small images, but

high-dimensional:
𝑑𝑑 = 32 × 32 × 3 = 3072
Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.
AN2DL, Boracchi
This resolution is by far smaller than what we are used to
𝒅𝒅 = 𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑
AN2DL, Boracchi
Former standard repository for ML research
𝒅𝒅 = 𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑
• 88% < 500 attributes

• 92% < 3.2K attributes
AN2DL, Boracchi
Former standard repository for ML research
𝒅𝒅 = 𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑
• 88% < 500 attributes

• 92% < 3.2K attributes
AN2DL, Boracchi
Second challenge: label ambiguity
A label might not uniquely identify the image
AN2DL, Boracchi
Second challenge: label ambiguity
Man?
Beer?
Dinner?
Restaurant?
Sausages?
….
AN2DL, Boracchi
Third challenge: transformations
There are many transformations that change

the image dramatically, while not its label
AN2DL, Boracchi
Changes in the Illumination Conditions
AN2DL, Boracchi
Giacomo Boracchi
Deformations
© Copyright Patrick Roper

Copyright Christine Matthews
AN2DL, Boracchi
View Point Change
… and many others
Fourth challenge: inter-class variability
Images in the same class might be

dramatically different
AN2DL, Boracchi
Inter-class variability
Fifth problem: perceptual similarity
Perceptual similarity in images is not

related to pixel-similarity
AN2DL, Boracchi
Nearest Neighborhood Classifiers for Images
Assign to each test image, the label of the closest image in the training set
𝑦𝑦�𝑗𝑗 = 𝑦𝑦𝑗𝑗 ∗ , being 𝑗𝑗 ∗ = argmin 𝑑𝑑( 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑖𝑖 )
𝑖𝑖=1…𝑁𝑁
Distances are typically measured as

𝟐𝟐
𝑑𝑑 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑖𝑖 = 𝒙𝒙𝒋𝒋 − 𝒙𝒙𝒊𝒊 = � 𝒙𝒙𝒋𝒋 − 𝒙𝒙𝒊𝒊 𝑘𝑘
𝟐𝟐 𝑘𝑘 𝑘𝑘
Or
𝑑𝑑 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑖𝑖 = 𝒙𝒙𝒋𝒋 − 𝒙𝒙𝒊𝒊 = � 𝒙𝒙𝒋𝒋 − 𝒙𝒙𝒊𝒊 𝑘𝑘
𝑘𝑘 𝑘𝑘
AN2DL, Boracchi
Pixel-wise distance among images
K-Nearest Neighborhood Classifiers for Images
Assign to each test image, the most frequent label among the
𝑲𝑲 −closest images in the training set
𝑦𝑦�𝑗𝑗 = 𝑦𝑦𝑗𝑗 ∗ , being 𝑗𝑗 ∗ the mode of 𝒰𝒰𝐾𝐾 𝒙𝒙𝑗𝑗
where 𝒰𝒰𝐾𝐾 (𝒙𝒙𝑗𝑗 ) contains the 𝐾𝐾 closest training images to 𝒙𝒙𝑗𝑗
Setting the parameter 𝐾𝐾 and the distance measure is an issue
Nearest Neighborhood Classifier (𝑘𝑘-NN) for Images
1-
𝑘𝑘-NN for Images
Pros:
• Easy to understand and implement
• It takes no training time
Cons:
• Computationally demanding at test time, when 𝑇𝑇𝑇𝑇 is large
and 𝑑𝑑 is also large.
• Large training sets must be stored in memory.
• Rarely practical on images: distances on high-dimensional
objects are difficult to interpret.
AN2DL, Boracchi
Perceptual Similarity vs Pixel Similarity
The three images have the same pixel-wise distance from the original
one...
…but perceptually they are very different
Perceptual Similarity vs Pixel Similarity
𝑥𝑥1 𝑥𝑥2
𝑥𝑥𝒋𝒋 − 𝑥𝑥0 ≈ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑗𝑗 = 1,2,3

𝟐𝟐
𝑥𝑥0
𝑥𝑥3
Let’s see what happens on the whole CIFAR10 using t-SNE
AN2DL, Boracchi
On CIFAR10 we see exactly this problem
Using any pixel-wise

distance measure, and in
particular 𝑥𝑥1 − 𝑥𝑥0 𝟐𝟐 to
compare images is not
appropriate
Some special model is needed to

handle images…
we’ll see in the next class!

2023 AN2DL Lez 1 Image Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2023 AN2DL Lez 1 Image Classification

Uploaded by

Copyright:

Available Formats

Introduction to the Course

and to Digital Images

My research interests are mathematical and statistical methods for:

• Computer Vision and Pattern Recognition (MSc in USI, Spring 2020)

An interdisciplinary scientific field that deals with how

An interdisciplinary scientific field that deals with how

http:focal.systems AN2DL, Boracchi

http:focal.systems AN2DL, Boracchi

Objects detected directly on a smartphone

Inspection 1863 709

Silicon Wafer BasketBall ClusterBig ClusterSmall Donut Fingerprints GeometricScratch

Grid HalfMoon Incomplete Ring Slice ZigZag

source: CVPR 2019 Welcome slides

Then we will move to the most advanced ones…

But first, let’s have a look at images

𝐵𝐵 ∈ ℝ𝑅𝑅×𝐶𝐶 AN2DL, Boracchi

123 122 134 121 132

122 121 125 132 124

𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶 119 127 137 119 139

from skimage.io import imread

# Read the image

# Extract the color channels % Matlab

In this example: 𝑅𝑅 = 144, 𝐶𝐶 = 180, thus these 5 color frames contains:

Fortunately, visual data are very redundant, thus compressible

Input Image Output Image

Input Image Output Image

𝑤𝑤(−1, −1) 𝑤𝑤(−1,0) 𝑤𝑤(−1,1)

𝐼𝐼 ⊗ 𝑤𝑤 (𝑟𝑟, 𝑐𝑐) = � � 𝑤𝑤(𝑢𝑢, 𝑣𝑣) ∗ 𝐼𝐼(𝑟𝑟 + 𝑢𝑢, 𝑐𝑐 + 𝑣𝑣)

𝑤𝑤(−1, −1) 𝑤𝑤(−1,0) 𝑤𝑤(−1,1)

Point-wise 𝑤𝑤 0,1 ∗ 𝐼𝐼(𝑟𝑟 + 0 𝑐𝑐 + 1)

Easy to understand with binary images

Each point in a white area is

Λ = {"wheel", "cars", …, "castle", "baboon"}

Λ = {"wheel", "cars", …, "castle", "baboon"}

“wheel” 65%, “tyre” 30%..

“castle” 55%, “tower” 43%.. 53

Assign to an input image 𝑥𝑥 ∈ ℝ𝑅𝑅 × 𝐶𝐶 × 3 :

input layer Hidden layer(s)

input layer Output Layer AN2DL, Boracchi

input layer Output Layer AN2DL, Boracchi

On CIFAR10 images, the number of neurons would be:

input layer Output Layer

input layer Output Layer

input layer Output Layer

9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 … 32

𝑊𝑊 34 𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)

being 𝒔𝒔𝑗𝑗 the 𝑖𝑖 −th component of the vector

𝑊𝑊 34 𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)

-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 -2

and indicate which are the most important pixels / colours

This is a further confirmation why it becomes pointless to stack many

Solving this minimization problem provides the weights of our classifier

The loss function has to be typically regularized to achieve a unique

𝑊𝑊, 𝑏𝑏 = argmin � ℒ𝑊𝑊,𝑏𝑏 𝒙𝒙, 𝑦𝑦𝑖𝑖 + 𝜆𝜆 ℛ(𝑊𝑊, 𝑏𝑏)

being 𝜆𝜆 > 0 a parameter balancing the two terms

-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 -2

Let’s have a look at these templates

𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶 𝐺𝐺 ∈ ℝ𝑅𝑅×𝐶𝐶 𝐵𝐵 ∈ ℝ𝑅𝑅×𝐶𝐶 car template in ℝ𝑅𝑅×𝐶𝐶×3

𝐼𝐼 ⊗ 𝑇𝑇1 0,0 = � 𝑇𝑇1 (𝑥𝑥, 𝑦𝑦) ∗ 𝐼𝐼(𝑥𝑥, 𝑦𝑦)

Credits Eugenio Lomurno! (visualization with clipped colors)

Images are very high-dimensional image data

Extremely small images, but

• 88% < 500 attributes