Professional Documents
Culture Documents
2023 AN2DL Lez 1 Image Classification
2023 AN2DL Lez 1 Image Classification
AN2DL, Boracchi
Computer Vision
AN2DL, Boracchi
Computer Vision
AN2DL, Boracchi
Object Detection
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767. AN2DL, Boracchi
Object Detection
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767. AN2DL, Boracchi
Pose Estimation
Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 7291-7299). AN2DL, Boracchi
Automatic Shelf Analysis
Di Bella, Carrera, Rossi, Fragneto, Boracchi Wafer Defect Map Classification Using Sparse Convolutional Neural Networks ICIAP19
AN2DL, Boracchi
Our SSCN architecture
Perfoms both Classification and Novel-class detection
16
L. Frittoli, D. Carrera, B. Rossi, P. Fragneto "Deep Open-Set Recognition for Silicon Wafer Production Monitoring” under AN2DL,
reviewBoracchi
Image Captioning
"little girl is eating piece of cake." "black cat is sitting on top of suitcase."
Andrej Karpathy, Li Fei-Fei "Deep Visual-Semantic Alignments for Generating Image Descriptions" CVPR 2015 AN2DL, Boracchi
Even though sometimes it fails
AN2DL, Boracchi
CVPR
Attendance Trend
AN2DL, Boracchi
AN2DL, Boracchi
CVPR 2022
AN2DL, Boracchi
Today…
We will start fresh from the simplest visual recognition problem:
Image Classification
AN2DL, Boracchi
Let’s start:
Digital Images
AN2DL, Boracchi
Photo Credits: Andrea Sanfilippo
AN2DL, Boracchi
RGB Images
𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶
𝐼𝐼 ∈ ℝ𝑅𝑅×𝐶𝐶×3
𝐺𝐺 ∈ ℝ𝑅𝑅×𝐶𝐶
[253, 5, 6]
This is an RGB triplet
R = 253; G = 2; B = 6
[106,124,138]
AN2DL, Boracchi
The Input of our Neural Network!
Three dimensional arrays 𝐼𝐼 ∈ ℝ𝑅𝑅×𝐶𝐶×3
When loaded in memory, image sizes are much larger than on the disk
where images are typically compressed (e.g. in jpeg format)
AN2DL, Boracchi
Videos
AN2DL, Boracchi
Higher dimensional images
Videos are sequences of images (frames)
If a frame is
𝐼𝐼 ∈ ℝ𝑅𝑅 × 𝐶𝐶 × 3 print(V.shape)
a video of T frames is (144, 180, 3, 30)
𝑉𝑉 ∈ ℝ𝑅𝑅 × 𝐶𝐶 × 3 × 𝑇𝑇
This has to be taken into account when you design a Machine learning
algorithm for images or vides
- e.g. during training a neural network these information are not
compressed!
AN2DL, Boracchi
Local (Spatial) Transformations:
Correlation
Most important image processing operations for
classification problems
AN2DL, Boracchi
Local (Spatial) Transformation
These transformations mix all the pixels locally “around“ the
neighborhood 𝑈𝑈 of a given pixel:
𝐺𝐺 𝑟𝑟, 𝑐𝑐 = 𝑇𝑇𝑈𝑈 𝐼𝐼 (𝑟𝑟, 𝑐𝑐)
Where
• 𝐼𝐼 is the input image
• 𝐺𝐺 is the output
• 𝑈𝑈 is a neighbourhood, identifies a region defining the output
• 𝑇𝑇𝑈𝑈 : ℝ3 → ℝ3 or 𝑇𝑇𝑈𝑈 : ℝ3 → ℝ is a the spatial transf. function
The output at pixel (𝑟𝑟, 𝑐𝑐) i.e., 𝑇𝑇𝑈𝑈 𝐼𝐼 (𝑟𝑟, 𝑐𝑐) is defined by all the intensity
values: 𝐼𝐼 𝑢𝑢, 𝑣𝑣 , 𝑢𝑢 − 𝑟𝑟, 𝑣𝑣 − 𝑐𝑐 ∈ 𝑈𝑈
AN2DL, Boracchi
Local (Spatial) Filters
The dashed square represents 𝐼𝐼 𝑢𝑢, 𝑣𝑣 , 𝑢𝑢 − 𝑟𝑟, 𝑣𝑣 − 𝑐𝑐 ∈ 𝑈𝑈
𝐼𝐼 𝑇𝑇𝑈𝑈 [𝐼𝐼]
(𝑟𝑟, 𝑐𝑐) (𝑟𝑟, 𝑐𝑐)
𝑈𝑈
𝑇𝑇𝑈𝑈 𝐼𝐼 (𝑟𝑟, 𝑐𝑐)
AN2DL, Boracchi
Local (Spatial) Filters
The dashed square represents 𝐼𝐼 𝑢𝑢, 𝑣𝑣 , 𝑢𝑢 − 𝑟𝑟, 𝑣𝑣 − 𝑐𝑐 ∈ 𝑈𝑈
𝐼𝐼 𝑇𝑇𝑈𝑈 [𝐼𝐼]
(𝑟𝑟, 𝑐𝑐) 𝑇𝑇𝑈𝑈 (𝑟𝑟, 𝑐𝑐)
𝑈𝑈
AN2DL, Boracchi
Local (Spatial) Filters
The dashed square represents 𝐼𝐼 𝑢𝑢, 𝑣𝑣 , 𝑢𝑢 − 𝑟𝑟, 𝑣𝑣 − 𝑐𝑐 ∈ 𝑈𝑈
𝐼𝐼 𝑇𝑇𝑈𝑈 [𝐼𝐼]
(𝑟𝑟, 𝑐𝑐) 𝑇𝑇𝑈𝑈 (𝑟𝑟, 𝑐𝑐)
𝑈𝑈
And (𝑢𝑢, 𝑣𝑣) has to be interpreted as a "displacement vector" w.r.t. the neighborhood
center (𝑟𝑟, 𝑐𝑐), e.g., 𝑢𝑢, 𝑣𝑣 ∈ { 1, −1 , 1,0 , (1, −1) … }
AN2DL, Boracchi
Local (Spatial) Filters
𝐼𝐼 𝑇𝑇𝑈𝑈 [𝐼𝐼]
(𝑟𝑟, 𝑐𝑐) 𝑇𝑇𝑈𝑈 (𝑟𝑟, 𝑐𝑐)
𝑈𝑈
(𝑟𝑟′, 𝑐𝑐′) 𝑇𝑇𝑈𝑈 (𝑟𝑟′, 𝑐𝑐′)
• Space invariant transformations are repeated for each pixel (don’t depend on 𝑟𝑟, 𝑐𝑐)
• 𝑇𝑇 can be either linear or nonlinear
AN2DL, Boracchi
Local Linear Filters
Linear Transformation: Linearity implies that the output 𝑇𝑇 𝐼𝐼 𝑟𝑟, 𝑐𝑐 is a
linear combination of the pixels in 𝑈𝑈:
𝑇𝑇 𝐼𝐼 𝑟𝑟, 𝑐𝑐 = � 𝑤𝑤𝑖𝑖 (𝑢𝑢, 𝑣𝑣) ∗ 𝐼𝐼(𝑟𝑟 + 𝑢𝑢, 𝑐𝑐 + 𝑣𝑣)
𝑢𝑢,𝑣𝑣 ∈𝑈𝑈
𝑐𝑐
Considering some weights {𝑤𝑤𝑖𝑖 }
We can consider weights as an
image, or a filter ℎ
𝑟𝑟
The filter ℎ entirely defines this
operation ℎ
AN2DL, Boracchi
Local Linear Filters
Linear Transformation: the filter weights can be assoicated to a matrix 𝒘𝒘
𝑇𝑇 𝐼𝐼 𝑟𝑟, 𝑐𝑐 = � 𝑤𝑤𝑖𝑖 (𝑢𝑢, 𝑣𝑣) ∗ 𝐼𝐼(𝑟𝑟 + 𝑢𝑢, 𝑐𝑐 + 𝑣𝑣)
𝐼𝐼
𝑢𝑢,𝑣𝑣 ∈𝑈𝑈 𝐼𝐼
𝒘𝒘
AN2DL, Boracchi
Correlation
The correlation among a filter 𝑤𝑤 = {𝑤𝑤𝑖𝑖𝑖𝑖 } and an image is defined as
𝐿𝐿 𝐿𝐿
∗
𝑤𝑤(1, −1) 𝑤𝑤(1,0) 𝑤𝑤(1,1) 𝐼𝐼 ⊗ 𝑤𝑤 (𝑟𝑟, 𝑐𝑐)
𝑤𝑤 𝐼𝐼(𝑟𝑟 + 0 𝑐𝑐 + 1)
AN2DL, Boracchi
Correlation
acc = 0;
for i in np.arange(template_height)
for j in np.aragne(template_width)
acc = acc + image[y + i, x + j]*template[i,j]
image[x+ template_height//2, y + template_width//2] = acc x x+i
y+j
AN2DL, Boracchi
Correlation for BINARY target matching
𝐼𝐼 𝑤𝑤
⊗ =
AN2DL, Boracchi
𝐼𝐼 𝑤𝑤
⊗
AN2DL, Boracchi
𝐼𝐼 𝑤𝑤
⊗
The maximum
is here
AN2DL, Boracchi
However…
* =
* =
Normalization is needed when using
correlation for template matching!
Each point in a white area is
as big as the template
achieve the maxium value
(togheter with the perfect
match)
AN2DL, Boracchi
The Image Classification Problem
AN2DL, Boracchi
Image Classification
𝑥𝑥
“wheel”
𝑥𝑥
“castle” 52
Giacomo Boracchi, Reply AN2DL, Boracchi
November 2019
Image Classification
𝑥𝑥
𝑥𝑥
𝑥𝑥 → 𝑓𝑓𝜃𝜃 𝑥𝑥 ∈ Λ
54
AN2DL, Boracchi
Linear Classifier
The basic building block for deep architectures
AN2DL, Boracchi
How to feed images to NN?
… … …
𝑅𝑅×𝐶𝐶×3
…
𝐼𝐼 ∈ ℝ 𝐱𝐱 ∈ ℝ𝑑𝑑
23
…
21
34
…
12
𝑅𝑅 ∈ ℝ𝑅𝑅×𝐶𝐶 34
…
23
𝐺𝐺 ∈ ℝ𝑅𝑅×𝐶𝐶 𝒙𝒙 ∈ ℝ𝑑𝑑
𝐵𝐵 ∈ ℝ𝑅𝑅×𝐶𝐶 𝑑𝑑 = 𝑅𝑅 × 𝐶𝐶 × 3
AN2DL, Boracchi
Classification over the CIFAR-10 dataset
The CIFAR-10 dataset
contains 60000 images:
• Each image is 32x32 RGB
• Images are in 10 classes
• 6000 images per class
𝒙𝒙 ∈ ℝ𝑑𝑑 , 𝑑𝑑 = 3072
Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009. AN2DL, Boracchi
A 1-layer NN to Classify Images
𝑤𝑤𝑖𝑖,𝑗𝑗 is the weight associated to
the 𝑖𝑖-th neuron of the input when
𝑤𝑤1,1 computing the value at the 𝑗𝑗-th
𝑤𝑤1,2 output neuron
«Dog score»
𝐿𝐿 classes
𝑤𝑤𝐿𝐿,1 «Cat score»
𝑤𝑤1,𝑑𝑑
[…] […]
…
«Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑
𝑤𝑤𝐿𝐿,𝑑𝑑
𝑤𝑤1,1
𝑠𝑠1 = � 𝑤𝑤1,𝑗𝑗 𝑥𝑥𝑗𝑗 + 𝑏𝑏1
𝑤𝑤1,2 𝑗𝑗=1:𝑑𝑑
«Dog score»
𝐿𝐿 classes
𝑤𝑤𝐿𝐿,1 «Cat score»
𝑤𝑤1,𝑑𝑑
[…] […]
…
«Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑
𝑤𝑤𝐿𝐿,𝑑𝑑
AN2DL, Boracchi
Why don’t we take a larger network?
Dimensionality prevents us from using in a straightforward manner deep
NN as those seen so far.
Let’s take a network with an hidden layer having half of the neurons of
the input layer.
𝑤𝑤1,1
𝑤𝑤1,2
«Dog score»
𝐿𝐿 classes
«Cat score»
𝑤𝑤𝐿𝐿,1
𝑤𝑤1,𝑑𝑑
… «Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑 𝑤𝑤𝐿𝐿,𝑑𝑑
𝐿𝐿 classes
«Cat score»
𝑤𝑤𝐿𝐿,1
𝑤𝑤1,𝑑𝑑
… «Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑 𝑤𝑤𝐿𝐿,𝑑𝑑
𝐿𝐿 classes
not needed here
«Cat score» since there are no
𝑤𝑤𝐿𝐿,1 layers following
𝑤𝑤1,𝑑𝑑
… «Horse score»
𝐱𝐱 ∈ ℝ𝑑𝑑 𝑤𝑤𝐿𝐿,𝑑𝑑
𝐿𝐿 classes
not needed here
«Cat score» since there are no
𝑤𝑤𝐿𝐿,1 layers following
𝑤𝑤1,𝑑𝑑
… «Horse score» Rmk: we can also ignore the
softmax in the output since
𝐱𝐱 ∈ ℝ𝑑𝑑 𝑤𝑤𝐿𝐿,𝑑𝑑
this would not change the
order of the scores (would
just normalize them)
input layer Output Layer
AN2DL, Boracchi
A 1-layer NN to Classify Images
𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 23 -2 -4 𝑠𝑠1 dog score
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 ∗ … + 32 = 22 𝑠𝑠2 cat score
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 21 -1 33 𝑠𝑠3 horse score
34
𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)
…
Rmk: colors indicate 12
to which color plane
34
in the image these
weights refer to …
23
𝒙𝒙
Unroll the image
column-wise
AN2DL, Boracchi
A 1-layer NN to Classify Images
𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 23 -2 -4 𝑠𝑠1 dog score
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0 ∗ … + 32 = 22 𝑠𝑠2 cat score
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 21 -1 33 𝑠𝑠3 horse score
34
𝒃𝒃 𝒦𝒦(𝒙𝒙; 𝑊𝑊, 𝒃𝒃)
…
Rmk: colors indicate 12
to which color plane
34
in the image these
weights refer to …
23
𝒙𝒙
Unroll the image
column-wise
AN2DL, Boracchi
This simple layer is a linear classifier
In linear classification 𝒦𝒦 is a linear function:
𝒦𝒦 𝒙𝒙 = 𝑊𝑊𝑊𝑊 + 𝑏𝑏
where 𝑊𝑊 ∈ ℝ𝐿𝐿×𝑑𝑑 , 𝑏𝑏 ∈ ℝ𝐿𝐿 are the parameters of the classifier 𝒦𝒦.
𝑊𝑊 are referred to as the weights, 𝑏𝑏 the bias vector.
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8 23 -2 -4 𝑠𝑠1 dog score
∗ + = 22 𝑠𝑠2 cat score
𝐿𝐿
𝑊𝑊 𝒃𝒃
AN2DL, Boracchi
Training the Linear Classifier
AN2DL, Boracchi
Training a Classifier
Given a training set 𝑇𝑇𝑇𝑇 and a loss function, define the parameters that
minimize the loss function over the whole 𝑇𝑇𝑇𝑇
In case of linear classifier
𝑊𝑊, 𝑏𝑏 = argmin � ℒ𝑊𝑊,𝑏𝑏 𝒙𝒙, 𝑦𝑦𝑖𝑖
𝑊𝑊∈ℝ𝐿𝐿×𝑑𝑑 , 𝑏𝑏∈ ℝ𝐿𝐿 𝒙𝒙𝑖𝑖 ,𝑦𝑦𝑖𝑖 ∈𝑇𝑇𝑇𝑇
AN2DL, Boracchi
Loss Function
Loss function: a function ℒ that measures our unhappiness with the
score assigned to training images
The loss ℒ will be high on a training image that is not correctly
classifier, low otherwise.
AN2DL, Boracchi
Loss Function Minimization
Loss function can be minimized by gradient descent and all its variants
(see Prof. Matteucci classes)
AN2DL, Boracchi
… Once Trained
The training data is used to learn the parameters 𝑊𝑊, 𝒃𝒃
The classifier is expected to assign to the correct class a score that is
larger than that assigned to the incorrect classes.
Once the training is completed, it is possible to discard the whole
training set and keep only the learned parameters.
𝑊𝑊 𝒃𝒃
AN2DL, Boracchi
Geometric Interpretation of a Linear Classifier
𝐖𝐖[𝐢𝐢, : ] is a 𝑑𝑑 −dimensional vector containing the weights of the score
function for the 𝑖𝑖-th class.
Computing the score function for the 𝑖𝑖 −th class corresponds to computing
the inner product (and summing the bias)
𝑊𝑊 𝑖𝑖, : ∗ 𝒙𝒙 + 𝑏𝑏𝑖𝑖
Thus, the NN computes the inner products against 𝐿𝐿 different weights
vectors, and selects the one yielding the largest score (up to bias correction)
Rmk: these “inner product classifiers“ operate independently, and the output
of the 𝑗𝑗-th row is not influenced weights at a different row
Rmk: this would not be the case if the network had hidden layer that would
mix the outputs of intermediate layers
AN2DL, Boracchi
Geometric Interpretation of a Linear Classifier
In Python notation:
In Python ∗ denotes the element-wise product, here I mean the inner
product of vectors:
np. inner 𝑊𝑊 𝑖𝑖, : , 𝒙𝒙 + 𝑏𝑏𝑖𝑖
AN2DL, Boracchi
Geometric Interpretation
Interpret each image as a point in ℝ𝑑𝑑 .
Each classifier is a weighted sum of pixels, which corresponds to a
linear function in ℝ𝑑𝑑
In ℝ2 these would be
𝑓𝑓 𝑥𝑥1 , 𝑥𝑥2 = 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏
Then, points 𝑥𝑥1 , 𝑥𝑥2 yielding
𝑓𝑓 𝑥𝑥1 , 𝑥𝑥2 = 0
would be lines.
Thus, in ℝ2 the region that separates positive
from negative scores for each class is a line.
This region becomes an hyperplane in ℝ𝒅𝒅
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Template Matching Interpretation
In Python notation:
• 𝑊𝑊[𝑖𝑖, : ] is a 𝑑𝑑 −dimensional vector containing the weights of the score
function for the 𝑖𝑖 −th class
• Computing the score function for the 𝑖𝑖 −th class corresponds to computing
the inner product
𝑊𝑊 𝑖𝑖, : ∗ 𝒙𝒙 + 𝑏𝑏𝑖𝑖
Then, 𝑊𝑊[𝑖𝑖, : ] can be seen as a template used in matching (the output of
correlation in the central pixel)
The template 𝑊𝑊 𝑖𝑖, : is learned to match at best images belonging to the 𝑖𝑖 −th
class
AN2DL, Boracchi
Bring the classifier weights back to images
𝑊𝑊
𝑑𝑑 = 𝑅𝑅 × 𝐶𝐶 × 3
-8.1 ... 2.7 9.5 ... -9.0 -5.4 ... 4.8
9.0 … 5.4 4.8 … 1.2 9.5 … -8.0
1.2 … 9.5 -8.0 … 8.1 -2.7 … 9.5 𝑊𝑊[𝑖𝑖, : ] ∈ ℝ𝑑𝑑 , car classifier
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Templates Learned on the CIFAR-10 dataset
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
The Class Score
The classification score is then computed as the correlation between each input
image and the «template» of the coresponding class
⊗ ≫ ⊗
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Templates Learned on the CIFAR-10 dataset
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Templates Learned on the CIFAR-10 dataset
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Templates Learned on the CIFAR-10 dataset
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Linear Classifier as a Template Matching
What has the classifier learned?
• That the background of bird and frog is green, (plane and boat is blue)
• Cars are typically red
• Horses have two heads!
The model was definitively too simple / data were not enough for achieving
higher performance and better templates
However:
• Linear Classifiers are among the most important layer of NN
• Such a simple model can be interpreted (with more sophisticated models you
typically can’t)
AN2DL, Boracchi
Linear Classifier as a Template Matching
What has the classifier learned?
• That the background of bird and frog is green, (plane and boat is blue)
• Cars are typically red
• Horses have two heads!
The model was definitively too simple / data were not enough for achieving
higher performance and better templates
However:
• Linear Classifiers are among the most important layer of NN
• Such a simple model can be interpreted (with more sophisticated models you
typically can’t)
AN2DL, Boracchi
Do it yourself!
https://colab.research.google.com/drive/1kflPH3CDgnvk1JptU0Cbp2LK-
0w0h6R3?usp=sharing
AN2DL, Boracchi
First challenge: dimensionality
AN2DL, Boracchi
CIFAR-10 dataset
The CIFAR-10 dataset
contains 60000 images:
Each image is 32x32 RGB
Images are in 10 classes
6000 images per class
𝒅𝒅 = 𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑
AN2DL, Boracchi
Former standard repository for ML research
𝒅𝒅 = 𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑
𝒅𝒅 = 𝟑𝟑𝟑𝟑𝟑𝟑𝟑𝟑
AN2DL, Boracchi
Second challenge: label ambiguity
Man?
Beer?
Dinner?
Restaurant?
Sausages?
….
AN2DL, Boracchi
Third challenge: transformations
AN2DL, Boracchi
Changes in the Illumination Conditions
AN2DL, Boracchi
Giacomo Boracchi
Deformations
AN2DL, Boracchi
View Point Change
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
… and many others
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Fourth challenge: inter-class variability
AN2DL, Boracchi
Inter-class variability
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Fifth problem: perceptual similarity
AN2DL, Boracchi
Nearest Neighborhood Classifiers for Images
Assign to each test image, the label of the closest image in the training set
𝑦𝑦�𝑗𝑗 = 𝑦𝑦𝑗𝑗 ∗ , being 𝑗𝑗 ∗ = argmin 𝑑𝑑( 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑖𝑖 )
𝑖𝑖=1…𝑁𝑁
Or
𝑑𝑑 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑖𝑖 = 𝒙𝒙𝒋𝒋 − 𝒙𝒙𝒊𝒊 = � 𝒙𝒙𝒋𝒋 − 𝒙𝒙𝒊𝒊 𝑘𝑘
𝑘𝑘 𝑘𝑘
AN2DL, Boracchi
Pixel-wise distance among images
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
K-Nearest Neighborhood Classifiers for Images
Assign to each test image, the most frequent label among the
𝑲𝑲 −closest images in the training set
𝑦𝑦�𝑗𝑗 = 𝑦𝑦𝑗𝑗 ∗ , being 𝑗𝑗 ∗ the mode of 𝒰𝒰𝐾𝐾 𝒙𝒙𝑗𝑗
where 𝒰𝒰𝐾𝐾 (𝒙𝒙𝑗𝑗 ) contains the 𝐾𝐾 closest training images to 𝒙𝒙𝑗𝑗
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Nearest Neighborhood Classifier (𝑘𝑘-NN) for Images
1-
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
𝑘𝑘-NN for Images
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
𝑘𝑘-NN for Images
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
𝑘𝑘-NN for Images
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
𝑘𝑘-NN for Images
Pros:
• Easy to understand and implement
• It takes no training time
Cons:
• Computationally demanding at test time, when 𝑇𝑇𝑇𝑇 is large
and 𝑑𝑑 is also large.
• Large training sets must be stored in memory.
• Rarely practical on images: distances on high-dimensional
objects are difficult to interpret.
AN2DL, Boracchi
Perceptual Similarity vs Pixel Similarity
The three images have the same pixel-wise distance from the original
one...
…but perceptually they are very different
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Perceptual Similarity vs Pixel Similarity
𝑥𝑥1 𝑥𝑥2
𝑥𝑥0
𝑥𝑥3
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
Let’s see what happens on the whole CIFAR10 using t-SNE
AN2DL, Boracchi
On CIFAR10 we see exactly this problem
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
On CIFAR10 we see exactly this problem
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi
On CIFAR10 we see exactly this problem
CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/ AN2DL, Boracchi