ML Questions 2021

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

Questions of Machine Learning 2021

1. Which one describes overfitting case?

Solution A. When a predictor has an excellent performance on the training set but its
performance on the true world is very poor, then we are overfitting.

2. How to avoid overfitting (regularization methods)?


One of the solutions for avoiding overfitting is that of using cross-validation.
Given the observations (our data), we decide to be more safe, not considering all the
samples but dividing these into a training and a test set. In this way, the test part of
the data is not involved in the training procedure. What we can do is to use this
subset of the training subset to elaborate and to perform a cross validation. At the
end of the cross validation procedure the model that we got from that will be applied
on the test set.

If the training samples at the end are large enough, we can further divide it into the
real training part and the validation part. The validation will have a similar role than
the testing, but will be part of the original training sample and we have access to the
validation during training.

Besides Cross Validation, there is another important aspect that we have to take into
consideration to avoid overfitting, and it is the complexity of the function that we are
defining. So when we learn something, a good thing is to try to control complexity by
penalizing a complex model, and this procedure is called the regularization.
Basically, when we learn something, in practice we are minimizing the Empirical
Risk, but we have to be careful and avoid complex solutions. Applying regularization

1
means including a term in the minimization process which is a term that measures
the complexity of the function h, and this term is λ . The lambda parameter is a
positive number that's that is used as a conversion rate between the loss and the
hypothesis complexity

3. Types of SVMs used for other purposes besides linear classification (pairwise
ranking SVM, outlier detection SVMs)
Some variants of a SVM named Support Vector Data Description (SVDD) and One
Class SVM (OCSVM) that are two strategies to identify the outliers in a distribution of
data.

SVDD: An intuitive way for outlier detection consists in estimating the density of the
data points in the feature space. The intuition is that samples in regions of high
density are inliers. Instead, samples in low density regions are outliers. In this SVM
variation the objective is to find the minimal circumscribing sphere around the data in
the feature space minimizing the radius.

OCSVM: One-Class SVM is nothing more than the standard SVM with just one class.
It corresponds exactly to an SVM where we want to maximize the margin of the
positive samples from the origin.
Actually there is a case in which the SVDD and the OCSVM are equivalent. This
happens where the data points are mapped from an input space to a feature space in
which all the points have the same distance from the origin and they can be
separated linearly from the origin.

Pairwise Ranking SVM: The supervision in the Pairwise Ranking SVM is composed
by pairs of images, and the label of each pair is the rank of one image with respect to
the others. All of these set of pairs are subject to some constraints. For each attribute
it's possible to learn a scoring function that, given the image x and a set of
parameters, returns the rank of that image. The scoring function can be learned
finding the best parameters that satisfy the constraints made with the pairs in the
training set. Since these constraints can be too strict and very complicated to
implement in a simple optimization problem, they are simplified with the max margin
relations using slack variables to tolerate errors. Minimizing the margin with these
constraints we will need to have the best ranking function.
An example might be the one where the Pairwise Ranking SVM is used to generate
precise textual information about an image. So for instance, given the following novel
image the algorithm will associate just a positive or a negative answer about the fact
that the image is dense.

2
In another example, the novel image is evaluated using several ranking functions
trained using the Pairwise Ranking SVM.

Given a novel image, all the learn ranking function are evaluated on it, and for each
attribute two reference images are identified. We can say that the novel image is
more dense than and less dense than other compared images, and not just is dense
or not dense.

4. Explain the difference between domain adaptation and transfer learning?


Domain Adaptation: if the test images are drawn from a completely different
distribution with respect to that of the training data then the performance will drop,
significantly.

In practice, the domain adaptation study exactly this kind of problem and try to find a
way to close the domain gap among the different kinds of images, or different kinds
of data, such that a model trained on one domain can then be reused efficiently also
on a different but related domain

Transfer Learning: the idea behind transfer learning is that of include some prior
knowledge inside a deep model, which also helps in the case in which we have a
limited amount of training samples. Basically, we are transferring knowledge from a
previously trained model on a new model. Nowadays this is practically the most used
technique. It's really the standard, it is not an exception, and we have to think about
the application of transfer learning not only for classification but for many other tasks,
like for instance detection or image captioning. Transfer learning can also be thought
as sort of regularization because somehow it improves the effectiveness of
generalization for the model by transferring knowledge from another pre-existing
model. So, it is a sort of initialization and helps to generalize.

3
5. How the freezing of the layers is made in transfer learning?
Layer freezing means layer weights of a trained model are not changed when they
are reused in a subsequent downstream task - they remain frozen. Essentially when
backpropagation is done during training these layers weights are untouched.
E.g. when we perform transfer learning from a neural network trained on a
(usually bigger) dataset to make it classify a new dataset (usually smaller).
(E.g. Homework 2)

6. Given noisy examples that I want to classify with SVM, how best to set the
value of the C parameter? Explain slack variables.
SVM: we might have multiple separating lines

Probably we might have some of these solutions by the perceptron. In SVM it's like if
we take our central separating hyperplane and we generate two other hyperplanes
on the margin.

Choosing this hyperplane, rather than others, our expectation is that if new data
arrives, of course there will be some misclassifications, but it should do better than all
the others ones.

● w can be expressed as a weighted linear combination of only a subset of the


instances and that is a key feature for support vector machines
● Only the points that are on the margin determine the solution. So, if for
example we have 300.000 points for spam and 300.000 points for ham, we
could have instead one point for ham and two points for spam, we would have
got exactly the same solution. Only the points on the margin matters
● whenever the problem is linearly separable we know that all the points that
are not on the margin are away

4
If training data includes this point, it's just that linear separation is not possible.
Having real data, we might have some noise so we give ourselves the permission to
make some mistakes. Soften a little bit we will find out that we still have a convex
optimization problem. So, even if we made our margin softer and we can still solve
the problem with exactly the same techniques we have seen so far. What we need to
do is minimizing the amount of slack.
What is changing is that we are limiting the value of α. The effect of C, that is the
constant that determines the number of slack variables, is affecting the values
that the α_i can take. We also know that the α_i are the coefficients that determine
among all the training data what point is going to be in the solution and what is
not. So, given that the slack variables are the mistakes that we are allowing to
make, C is proportional to the mistakes that we are making and that this coefficient
tells me how soft our margin is or how hard it is. And we say that the hard margin is
when we cannot make any mistakes. The larger C the more mistakes we can do is
related to the values that alpha can take, because the values of alpha determine how
sparse or non-sparse our solution is. A sparse solution is a solution with very few
support vectors. What we expect is that when we have a soft margin classifier we
will have a lot of support vectors, because we are allowing a lot of mistakes.

7. Explain K-means clustering, unsupervised learning algorithms.


Clustering is one of the most used Unsupervised methods. It is the process of
creating groups of objects in which in each group the objects should be similar, and
each group represents a class. The logic is to create groups characterized by the
higher intra-class similarity and the lower inter-class similarity.
One algorithm that can be used to clusterize a set of data is the k-means one. This
algorithm can be applied with the assumptions that the points are in the Euclidean
space and so the similarity is computed by using the Euclidean distance.
As a first thing, it is important to choose the number of cluster that we want to create
and then associate one data point for each cluster.
Let’s see an example to understand this concept better.

In Figure 1 we have the data points that are in the Euclidean space, so the k-means
algorithm can be applicable. The number of cluster k, is chosen to be 2. The first
allocation of the centroids is done randomly (Figure 2).

5
For each data point is computed the Euclidean distance between the point and the
two centroids and for each point is associated the nearest centroid (Figure 3).

Once all the data points are assigned, the centroids are recomputed. Actually the
centroids are computed with the average between all the points assigned with the
centroid considered (Figure 4).

With the new clusters (Figure 5), the distance between each data points and the
centroids are recomputed and the point assignment is repeated. The centroids are
updated until we arrive in a quite good clustering solution in which the data points are
assigned to more iteration to the same cluster.

How to choose k?
This evaluation consists in keeping track of the average of the distances of the data
points and their centroid increasing the number of k.
At the end, plotting the graph of the values computed, we can observe that this is not
a real evaluation of the error because, as we said, in clustering there are no labels,
so we cannot really say what is wrong or not.

How do we select the k points?


1. At random
2. At random but multiple times re-initializing the k-means multiple times,
compare the distance curves and choose the one with the lowest elbow.
3. Dispersed set of points: this option should guarantee a good solution over all
the datasets. It consists in:
● pick the first centroid at random
● pick the second centroid as far as possible from the first one
● pick the third as far as possible from the second, and so on…
● iteratively: choose each centroid whose minimum distance from the
previous ones is as large as possible

6
Convergence
The algorithm terminates when there are quite small oscillations around the one local
minimum that, most of the time, is a good clustering solution. But, it can happen that
a particular initialization or seed, results in a suboptimal clustering solution. So, it is
important to select good seeds using heuristic or trying multiple starting points.

8. Explain CNNs, why non-linearities are important?


CNN: the idea is to substitute the original linear operator with the big W matrix that
was fully connecting the input nodes with the output nodes, and substitute it with
small filters. Small windows moving on the image, so small templates that will contain
local information and will move over the image applying the convolutional operator
which is nothing more than a product of the value within the kernel itself and the
function.

This allows us to reduce the number of connections. So, at this point will not have a
huge matrix W, but we will have just a small kernel that will be actually repeated and
moved around in the image

The effect of this reduction in terms of parameters of course will become even more
clear when we combine all the layers together, because each convolutional layer
takes as input the feature map of the previous layer and since we have reduced
somehow the amount of parameters of the previous layers then the effect will be
extended over the whole network. This really helped to decrease the number of
parameters with which we need to deal.
There is a further element that allows to reduce the amount of parameters of the
network itself which is pooling. The pooling layers are generally introduced among
one convolutional layer and the other.

The effect of pooling is just that of accumulating information and one way to do that
is, for instance, by calculating a norm.

7
The convolution operation is actually a linear combination of a filter matrix and
elements of an input matrix. Taken in the context of a convolutional neural
network, you have a filter (which could be of different sizes) that is used to
subject the input matrix to this convolution operation.
If you don’t have an activation function in your convolutional neural network,
you will have what’s essentially a linear activation function (y=x) end up
computing a simple linear combination of the inputs (regardless of how many
layers you have in your convolution or other deep network).

9. Difference between classification and regression tasks


Classification is the task of predicting a discrete class label.
Regression is the task of predicting a continuous quantity.

10. Logistic Regression, explain what it’s and why it’s called “regression” but is a
classifier
Logistic regression is a generalized linear model using the same basic
formula of linear regression but it is regressing for the probability of a
categorical outcome.

11. About Deep Learning and Neural Network, give an answer about a question
(true or false, and explain): can be a Neural Network sensible about the exact
location on the image (right-up corner, center, ecc..) of the subject?
No. If we use convolutional layers this extract low-level patterns independently on
their location on the image.

12. what is validation and why we use it?


In learning procedures, we have our training set and we can learn a model over this.
There might be different variants of the classifier of the model for our task and by
using the validation set we choose one of them, so we choose it by looking at the
prediction on the validation set. Once this is done, we take the model and we apply it
on the test set. In some cases, it is useful, once that the model is trained and the
parameters are chosen somehow, put together again the training and the validation
set just to give to the model a little bit of extra samples.

13. what is K-means clustering and why is it guaranteed to converge after a certain
number of steps?
Since it is an L2 loss, so we use an Euclidean distance or a Minkowski distance
which are convex, it is guaranteed to converge after a certain number of steps, which
depends also on the initialization of the centroids

14. explain KNN and Perceptron and their differences


kNN: Let's say that we have these set of data red dots and blue triangles and they
represents two different classes. Then someone gives us the black square and ask
which is the correct class where assign this sample.

8
In practice, we make a prediction just on the basis of the distance between the
square and the closest sample and by looking at the label of the closest sample. This
is how it works the nearest neighbor classifier. To make the prediction we can simply
consider the possible pairs of training sample and the line that connects them, and
then consider an orthogonal line that divides this segment into equal parts.

This cell is known as the Voronoi cell (Figure A). This procedure can be repeated
over all our training set to define what is called the Voronoi tessellation (Figure B). At
the end what we get is a very precise separation curve between the blue triangles
and the red dots.

In case, for instance we have in our training set a mislabeled sample, then we need
to introduce a second boundary to separate the single sample with respect to the
other class. For this reason, very small mistakes can change a lot the way in which
we end up dividing the two classes.
One way to avoid these mistakes due to outliers is that, instead of considering just
the single closest sample to each test instance, we consider a set of k most similar
training samples. So we pass from what is called one nearest neighbor to k nearest
neighbor.

How to choose k?
The value of k makes a substantial difference in the final prediction. When we
change k then what we do is to changing the boundary in the prediction.
A small k means that the method is more subject to outliers and so it tends to be
more precise in the description of the boundary but also more sensitive to the
outliers. When, instead, we consider a k which is larger, we will see that the
boundaries will be smoother and the prediction will also be less sensitive to outliers.
So, selecting the value of k consisting in just applying a cross validation procedure.
This means that k is a parameter that we have to tune depending on the prediction
that we get on a validation set. So we can validate our choice and then take the k
value that allow us to get the maximum validation prediction and this means that we
have a good indication of how the method will perform in terms of generalizations on
new test samples.

9
Choosing a k value which is odd we hope to unbalance the prediction giving more
samples to one of the two classes for the binary prediction.

Pros:
● kNN is a very simple classifier
● everything is based on distances and we make the hypothesis that nearby
regions of space have the same class. This is the only basic assumption on
which kNN is built.
● It's not a parametric approach, so differently with respect to SVM or the
perceptron, we are not taking the data, building a model on the basis of some
parameters and then we can even forget about the data, just relying on the
model. For kNN is different. We need to keep the data because everything is
based on the distance between our test samples and the training samples.
The positive aspect is that we don't have to infer some parameters, except for
the choice of k, we just need to keep the data, and if the training samples
change, eventually, if we plan to add new categories in our training samples,
we don’t need to retrain our model.
● Moreover kNN has a very good generalization guarantee, but it can be
shown that when n goes to infinity, so when the number of samples goes to
infinity, then the kNN model has an error which is less than twice the optimal
Bayes error. So we also have some good generalization guarantee for this
model when we have a very large number of training sample.

Cons:
● The time complexity of the method can be quite high, exactly depending on
the cost of computing each distance. So if this is costly, and most often it is,
then we have to multiply this cost by the number of training examples
because we have to understand which is the closest sample of our test
instance, and so it means that the overall procedure it's quite expensive.
● we don't have a training procedure, we only have a testing part. We just have
to take our test sample, calculate the distance and make a decision. So the
cost is exactly at test time, while in general a parametric model tends to be
more costly during the training and then quite effective and quite fast during
tests. Moreover, the larger is the dimensionality with which we are dealing
with, the more complex the calculus become, and this always scales linearly
with the number of samples we are dealing with. So of course this tends to be
quite expensive.

Perceptron:

where w are our synaptic weight, x our inputs, b the bias in order to be invariant for
translations and the sigma function a threshold.
We want a decision function that is non-linear and that's why we put a threshold and
a bias. The sigma function is any nonlinear function that makes sure that the f(x) is
between (0, 1) or (-1, +1). When we have a decision function like this, what we are
really building is a classifier that finds a linear separating hyperplane.

10
In practice, what do we learn with a perceptron is estimating the parameters w and b.
Once we know those, we have our classifier. What we are going to learn is stuff like
only one of the lines depicted in the figure on the left and the reason of all of these
lines is that all of them are possible and legitimate solutions.

How do we compute algorithms able to detect these lines?

When the prediction that we made are different from the true label (yi), then the w
that we have in memory doesn’t return the right prediction, so we need to update it.
So, we take the original w and we sum yi xi. The same is done to the b value. The
idea is that if we substitute w and b, then the result of the multiplication, between the
decision function and the true label, would become positive.

How does this algorithm works?


We take our training data, we have to find the equation of a plane in an n
dimensional space, then we want to find the value of the vector that defines the slope
(w) and the constant that decides where we intersect our axis (b), because that's
what it is in the equation of a plane or of a hyperplane, such that it separates
correctly all our training data; where correctly stays for that all those samples that
have a yi = +1 are on one side of the hyperplane while all the others are on the other
side.
w is a linear combination of all the mistakes that we make.

What really matters in a perceptron is the mistakes, when something good happens
we don't do anything when something bad happens we store something. In practice,
we store the errors. That also means that if you give me 100.000 data and my
perceptron makes mistakes on 10 points, from 100.000 points all I need to store is
10. We can see these terms as the memory of my algorithm, which is what we need
to remember about the training data. This also means that we can compress the
data.
It's not a good idea to use a perceptron with data that are noisy.

15. CNN are translation invariant?


Yes, because of the CNN structure, who moves around the image looking if it
catches the object. CNN can learn to recognize an object in an image no matter how
the object is translated (shifted horizontally and/or vertically), even if the training set
only includes that object in one position only.

11
16. CNN are rotation invariant?
No, just translation invariant, for that we use Data Augmentation. CNN can learn to
recognize an object in an image no matter how the object is rotated (in the image
place) even if the training set only includes the object in one orientation.

17. Between the 4 options choose the best one

C. was the best because it optimized on validation set but actually tested it on test
set

18. Question about deep learning: how to find the best parameters?
Grid Search, Cross Validation

Grid Search: The traditional way of performing hyperparameter optimization is with


grid search, which is an exhaustive search through a manually specified subset of
the hyperparameter space of a learning algorithm. A grid search algorithm must be
guided by some performance metric, typically measured by cross-validation.

19. (backpropagation) & CNN architecture


Back propagation is a computational technique that we use for our optimization
problem. Even if often we say that we apply backpropagation through the network,
the truth is that we are applying backpropagation through the computational graph of
the loss. So, we are taking our loss function, we are decomposing it to represent and
to obtain a computational graph with the simple sub functions and then we are
deriving over it

20. Layer freezing: what is it and when it’s applied


(frozen layers have LR=0)
It is used for transfer learning, when we have a network trained on another
dataset (usually big) and we want to exploit it for e.g. classification on another
dataset (usually smaller). We freeze all the layers but the classifier head (FC
layers)
21. shallow learning: what is K-Means clustering & its convergence
22. Kernel trick
If we live in a space where the data we are given are not linearly superable, we could
always think of mapping into a higher dimensional space with a non-linear mapping

12
(changing the metrics too) and my hope is that by changing the space where we
represent the data in that space our problem will become linearly separable.

If we know how to define non-linear scalar products we can make non-linear any
algorithm.

We can always build a mapping ϕ(x) for any polynomial of order d via 〈x,x' 〉^d, even though
we don't define it explicitly.

23. QUIZ
a. One colleague reduce the number of feature maps on the first layer in
the network (Convolutional layer);
b. one colleague reduce the number of feature maps on the last layer in
the network (Fully connected layer).
What is the best choice?
In the fully connected layer

The fully-connected layer tells us that each element of the input connects with each
element of the output and so contribute to each element of the output.

We have this huge amount of weights here. The edges are fully connected, so we
have this full connection and we need a weight for each of these edges. Most
probably we don't need to strictly have all these connections, we might be able to
reduce the amount of connection and possibly share weights.

The possibility to use this convolutional layer rather than the only fully connected
layer comes from the nature of the data themselves. For instance, the fact that a local
structure inside an image is repeated. So maybe instead of having a very huge W
matrix containing a large amount of weights, maybe we can take a smaller matrix W
that repeats multiple times for instance. This is exactly the idea of how to pass from
fully connected layer to convolutional layer using matrices, which are called filters
and they are smaller with respect to just having the huge W matrix.

24. linear regression vs ridge regression


Linear Regression:

13
In Linear Regression we have to minimize the mean squared error MSE
between the predicted y_hat and the ground truth y.

Ridge Regression:
In Ridge Regression we have to minimize the MSE, like in Linear Regression,
but in this case the optimization objective consists also in a L2 norm
penalization for the weight. This leads to a weight shrinkage and has a
regularization effect.

EXTRA (i)
Lasso Regression is similar to Ridge Regression but applies L1 norm. In this
case, beside weight shrinkage, we also have feature selection through setting
some weigths = 0 due to the L1 regularization.
EXTRA (ii)
ElasticNet is the combination of L1 and L2 weight norm penalization.

25. Given an input img 32x32x3 and a filter 5x5 with stride 1. What is the
dimensionality of the output of this convolutional filter?
(32 - 5)/1 + 1 = 28 => 28x28
Formula => (N-F)/Stride + 1

26. PCA how it works & some of its applications (Denoising)


27. LR very low: what we can observe; LR very high: what happens?
LR Very Low => Slow convergence
LR Very High => Rapid decrease of the loss and the overshooting of the
minimum

28. SVM: linear classifier, what happens when we remove one of the point very
close to decision boundaries? Also what happens if non-linear classification.
if linear classifier, ????????

if non linear classification ????????


29. Batch normalization & its use for domain adaptation
30. Difference between KNN and KMeans
K-Nearest Neighbor is a CLASSIFICATION algorithm, while K-Means is a
CLUSTERING algorithm.

31. K fold cross validation & Leave one out cross validation
In some cases the division into a training and the validation set might not be a good
choice because to do this separation we are reducing the amount of data in which we
can really train the model itself.

14
One thing that we can do is that we can divide the cross validation setting in multiple
equal parts and use only one of the parts as validation samples while the remaining
parts will be the real training. So instead of having a fixed validation, now we have
multiple validation because in turn we can choose each of these subsets as the
validation part of the training set. The approach is called the k-fold because we are
dividing the set of data into k parts, where only one is used as validation, while the
remaining k - 1 works as training. This is done in turn for all the k fold and then, to
refine the model, we take the average of the output out of these k predictions. In
practice, we can decide what is the best model just taking exactly the average of the
output over all the folds (green part of the previous figure).

In the case in which k is exactly equal to the cardinality, S, of our training sample,
then we have what is called the Leave-One-Out Cross Validation. In practice, every
single sample is kept out while the remaining S - 1 samples define the training model
and in turn each single sample is the validation set. The Leave-One-Out error is an
unbiased estimator of the generalization error and this is important because
everything we are interested in is the ability of our model to be a good predictor on
future samples. So, we want our model to generalize well. By using the Leave-One-
Out cross validation procedure we have a hint of how good is our model to
generalize.

32. # of parameters of a convolutional layer depends on the stride we apply or


not?
(nope)

33. perceptron vs svm


The main differences between SVM and the perceptron algorithm are that SVM
works with batches and it includes a large margin condition.

34. Do you think that adding more layers in a CNN would help to prevent
overfitting?
No

35. Why do we initialize weights in a nn?


vanishing gradient, flat loss

36. Describe me the learning process for a nn

15
hint: justify the previous answer
37. If I put one new sample for SVM what happens?
38. logistic regression
a. what it is?
b. if I give you a new sample, completely different and far from the others,
how does it influence the classifier?
The classifier changes because of that
c. what about if I had a SVM? What would happen? (The classifier in that
case does not change)
The classifier in that case does not change
39. tell me what’s going on in these three situations

Solution:
Model A: high learning rate
Model B: overfitting
Model C: underfitting / low learning rate

40. Classification vs Regression difference


Regression algorithms are used to predict the continuous values such as price,
salary, age, etc. and Classification algorithms are used to predict/Classify the
discrete values such as Male or Female, True or False, Spam or Not Spam, etc.

41. What is Ridge Regression talking about it as regularization method


42. AutoEncoders (they used MSE loss) and VAE
Autoencoder is an unsupervised artificial neural network that learns how to
efficiently compress and encode data then learns how to reconstruct the data back
from the reduced encoded representation to a representation that is as close to the
original input as possible.
Autoencoder, by design, reduces data dimensions by learning how to ignore the
noise in the data.
● autoencoders learn a “compressed representation” of input (could be
image,text sequence etc.) automatically by first compressing the input
(encoder) and decompressing it back (decoder) to match the original
input. The learning is aided by using distance function that quantifies
the information loss that occurs from the lossy compression. So
learning in an autoencoder is a form of unsupervised learning (or self-
supervised as some refer to it) - there is no labeled data.
● Instead of just learning a function representing the data ( a compressed
representation) like autoencoders, variational autoencoders learn the
parameters of a probability distribution representing the data. Since it

16
learns to model the data, we can sample from the distribution and
generate new input data samples. So it is a generative model like, for
instance, GANs.

43. PCA vs Autoencoder


PCA essentially learns a linear transformation that projects the data into another
space, where vectors of projections are defined by variance of the data. By restricting
the dimensionality to a certain number of components that account for most of the
variance of the data set, we can achieve dimensionality reduction.

Autoencoder is neural network that can be used to reduce the data into a low
dimensional latent space by stacking multiple non-linear transformations(layers).
They have an encoder-decoder architecture. The encoder maps the input to latent
space and decoder reconstructs the input.

Comparison
1. PCA is essentially a linear transformation but Auto-encoders are capable of
modelling complex non linear functions.
2. PCA features are totally linearly uncorrelated with each other since features
are projections onto the orthogonal basis. But autoencoded features might
have correlations since they are just trained for accurate reconstruction.
3. PCA is faster and computationally cheaper than autoencoders.
4. A single layered autoencoder with a linear activation function is very similar to
PCA.
5. Autoencoder is prone to overfitting due to high number of parameters.
(though regularization and careful design can avoid this)

44. Can autoencoders be used as input to a CNN?


depends whether the learning is supervised or unsupervised

45. Curse of dimensionality, kNN tricks


There are some characteristics of kNN that do not make it ideal and the problem lies
in the space itself and in the definition of distance. In practice, the problem is also
known as curse of dimensionality because it is related to the fact that we encounter
more and more problems when the dimension of the space with which we are
dealing, grows. Larger space means that distance is not really a good metric to
assign.

In KNN we don't really have a training procedure but we only have a testing part. We
just have to take our test sample, calculate the distance and make a decision. So the

17
cost is exactly at test time, while in general a parametric model tends to be more
costly during the training and then quite effective and quite fast during tests. For this
reason the problem for KNN is that the expensive part is exactly a test time and this
most often is a problem because it means that we cannot quickly apply the model at
test time, we'll always have to calculate the distances to be able to make a decision.
There are some trick to make kNN faster. In some cases we can make the procedure
a little bit cheaper if for instance we have some knowledge about the dimensionality
of our space and we know that out of all this d dimension there are only a subset r
which is really important. So in this sense, if we have this knowledge, then we can
reduce the calculus of the distance relying only on r elements rather than the whole
set of d elements.
Another possibility is that we know some structure of our training data, so we might
be able to organize our training data into trees for instance, or maybe we know that
there are some samples, of a certain class, that are more representative than others,
so they behave like prototypes and maybe we might only refer to that rather than
going over all the distances to the other examples. But in general having this
knowledge is not simple because it's costly the fact that we have to build the tree or
we have to extract the knowledge about prototypes.
One other thing that we can consider is editing or pruning the training samples. For
instance, if we know that there is a sample that happens to be in a region where all
the closest sample belong to the same class, then maybe we might decide to remove
this sample. The result will be that the Voronoi cell of the closest sample will enlarge
a little bit. So, we are editing, we are pruning, we are removing some information but
it still might be good. We cannot have any guarantee on which is the correct number
of elements to remove and we might end up having some problems if for instance our
training sample changes and we want to add new categories.
And there are many practical tools to make a kNN faster. For instance FLANN is a
fast library for approximate Nearest Neighbour or also ANN.

46. Multi-task learning? Example of models. (remember about classification and


segmentation), segmentation is regression bc we need to find 4 points for each
segment

47. Perceptron and what happen if we shuffle dataset


result will be different, but convergence is also guaranteed if data are linearly
separable
48. What is and why do we set batch size in a neural network?
The batch size is a hyperparameter of gradient descent that controls the number of
training samples to work through before the model's internal parameters are updated.
a. What we pass to the network during every epoch?
the entire dataset
49. How to pass from linear classifier to multi-class classifier in SVM? Discuss
about C parameter in SVM, how does it change decision boundaries?
The C parameter tells the SVM optimization how much you want to avoid
misclassifying each training example. For large values of C, the optimization will
choose a smaller-margin hyperplane if that hyperplane does a better job of
getting all the training points classified correctly. Conversely, a very small value
of C will cause the optimizer to look for a larger-margin separating hyperplane,

18
even if that hyperplane misclassifies more points. For very tiny values of C, you
should get misclassified examples, often even if your training data is linearly
separable.

50. structure of a neural network (answer: fully connected and convolutional) and
(broadly) how they work.
a. Which one has more parameters (FC).
51. ranking SVM, what it tries to predict?
order between data points, more specifically +1 if x>y, -1 if x<y
52. description of GAN, difference between GAN an VAE
53. difference between Logistic Regression and Linear Regression
First of all
- Linear Regression => it is a regression algorithm
- Logistic Regression => it is a (binary) classification algorithm
Linear regression objective is to predict a model that minimize the error in
predicting a continuous values for each sample under test by fitting a line into
the feature space.
Logistic regression’s objective, instead, consists in assigning a binary class
label {0,1} or {-1,+1} to each of the samples under test. It is based on the
sigmoid function that takes as input z=<w,x>+b and assigns x to 1 if z>0.5
or 0 otherwise.

54. How to solve Logistic Regression


The loss of the gradient is convex but the equation Gradient (loss)=0does not
have an analytical solution => We solve logistic regression through non linear
optimization

55. RNN
HIGH-LEVEL DESCRIPTION
Recurrent Neural Networks are a special class of NN characterized by internal
self-connections.
As an RNN processes sequential information, it performs the same operation
on every element of the input sequence. Its output, at each time step, depends
on the previous input and past computations. This allows the network to
develop a memory of previous events, which is encoded in its hidden state
variables.
BACKPROPAGATION THROUGH TIME
Gradient-based learning requires a closed-form between the model parameters
and the loss function.
In order to find a direct relation between the loss function and the network
weights, the RNN has to be represented as a DAG (directed acyclic graph). This
procedure is called unfolding and consist of replicating the network’s hidden
layer structure for each time interval, obtaining a particular kind of feed-
forward neural network.

TRAINING

19
The training of an RNN does not differ from the one employed from standard
NNs. (e.g. SGD with momentum, weight decay, dropout, regularization terms..)

VANISHING AND EXPLODING GRADIENT


Early RNNs suffers from these two problems. How to solve them?
Exploding G.=> Clip the norm of the gradient if above a certain threshold
Vanishing G.=>
- use of ReLu
- careful weight initialization
- dropout
- GRU, LSTM
-----------------------------> GRU & LSTM on the “RNN
Summary” on the telegram group

56. Difference between shallow learning and deep learning


One of the main difference between these two worlds regards the FEATURES
representation. In the shallow learning domain we have to perform a feature
engineering/learning process between feed them appropriately into the
model, while deep learning model are able to learn themselves an appropriate
feature representation.

57. Deep Learning architecture components


If we think to a “standard” DNN, like e.g. AlexNet, it is composed in the
following way.
At the beginning we have some convolutional layer, interleaved with pooling
layers. The former extract feature (maps) from the images (that goes from low
level to high level as we go deeper in the architecture), while the latter,
besides helping this process, reduces dimensionality.
The second half of the network is composed by a stack of fully connected
layers. Dropout can be added.
At the end of the network there is a linear layer with as many outputs as the
number of classes, that produces probabilities for each of the classes through
a softmax operation.

58. FC layer vs Conv. layer


In FC layer there is a dense connection between two layers (each neuron of
layer i connected to every neuron of layer i+1), and all edges have different
weights. (High number of parameters)
In Conv. layers the outgoing edges has the same weight for each input
variable (weight sharing). Each neuron is connected only to the adjacent
neurons in the subsequent layer (neighborhood): less parameters to train.

59. Kernel trick for SVM. What it is and why we use it?
we use it when we have non-linear problems.

60. What can be done if data to separate are not linear using SVM?
Apply the kernel trick for SVM

20
61. SVM is a binary multiclass classifier?
In its most simple type, SVM doesn’t support multiclass classification natively. It
supports binary classification and separating data points into two classes

62. How we go from binary to multiclass?


For multiclass classification, the same principle is utilized after breaking
down the multiclassification problem into multiple binary classification
problems.
The idea is to map data points to high dimensional space to gain mutual
linear separation between every two classes. This is called a One-to-One
approach, which breaks down the multiclass problem into multiple binary
classification problems. A binary classifier per each pair of classes.
Another approach one can use is One-to-Rest. In that approach, the
breakdown is set to a binary classifier per each class.
63. For which kind of model sparsity becomes an issue?
For model based on distance like clustering algorithms

64. Increasing the number of layers in a network, does help us reducing


overfitting?
No, it worsen the situation because it adds more parameter, hence more
complexity.

65. How to reduce the number of parameters in a network?


a. Act in an FC layer or Conv Layer?
Act on a FC layers, since it has many more connections (=>weights) to be
trained: more parameters. The way to go is using dropout.

66. Self-supervised learning


The problem with the current deep learning models is that we know that they are
extremely effective as far as we have a large amount of annotated data. And
annotating data might be extremely costly (e.g. manual annotation is extremely
expensive).
It is good to find a way to also exploit unsupervised data especially because there are
some situations where obtaining annotated data is not only costly but it is also
practically impossible (e.g.: medical data).
We need to find a way, for instance, to use all this large amount of unlabeled data
which might be provided to us for instance from online resources. And this is what
self-supervised learning does, it tries to extract structural knowledge from the data
independently from the labels, so it does not need a human labelling, and then it finds
a way to reuse this information.
What we need to do, in practice, is to start just from the data, not annotated, and we
remove part of the inherent information of the data and then ask a network to predict
it back. The idea is that in this way, by removing these typical characteristics of the
data, we are somehow forcing the network to find relevant information that at the end
of the day are also useful in terms of semantics. So they provide us with cues about
the structure of the world and how objects are defined in terms of shape, in terms of
part of relations, and these kinds of information then can be transferred to a second
task.

21
67. Gradient descent vs stochastic gradient descent
Gradient descent is a first order iterative algorithm
1) Initialize a weight vector θ
2) Iteratively compute θ(t +1)=θ(t )−α Gradient ( Loss)
3) Stop when a minimum is reached
Computational bottlenecks in terms of #samples w.r.t. which compute partial
derivatives and #parameters to be updated, at each step.
STOCHASTIC Gradient Descent: compute Gradient ( Loss) for a small
representative subsample m<<n of the samples
- the minibatch is drawn uniformly
- the true gradient is approximated, but significant speedup
REMARK
- SGD does not stop at the minimum due to noise induced by the random
sampling
- SGD generalizes better than standard GD because it does not depend on the
number of samples

68. Learning rate


The learning rate is a scalar coefficient that, multiplied by the norm of the
gradient, gives the amplitude of the step of the gradient descent.
It is important to carefully choose this value, and it is typically done by a linear
search.
The higher the learning rate, the more/faster the model learns.
Indeed, if we want - as an extreme example, to freeze a layer => LR = 0
Too low LR => slow convergence
Too high LR => overshooting the minimum (never reached)

69. Universal approximation theorem


Universal Approximation Theorem, in its lose form, states
that a feed-forward network with a single hidden layer
containing a finite number of neurons can approximate any
continuous function.
By approximate, we mean that by using enough hidden
neurons, we can always find a neural network whose
output g(x) satisfies |g(x)−f(x)|<ϵ, for all inputs x. In
other words, the approximation will be good to within
the desired accuracy for every possible input.
And if a function is discontinuous, i.e., makes sudden, sharp
jumps, then it won’t, in general, be possible to approximate
using a neural net.

70. Padding and zero padding in conv layer


The choice of how many pixels we move from one application of the filter to the
other, it is called stride.

22
In this example we cannot apply a 3x3 filter on a 7x7 input with stride 3 because the
filter exit from the image. If N is the number of components vertically and horizontally
for our input and F is the dimension of the filter and we know the stride, then to get
the output size we just need to apply the following rule:
(N-F)/stride+1
Let’s apply this rule to the previous examples with N = 7, F = 3:
Stride 1 => (7 - 3)/1+1=5
Stride 2 => (7 - 3)/2+1=3
Stride 3 => (7 - 3)/3+1=2.33
when we try stride 3, we see immediately that the ratio does not provide us an
integer number and, for this reason, we don't have a good fitting. In these cases, one
thing that we can do, is that of including zero padding on the border.

This eventually might introduce a bit of noise because we are arbitrary choosing a
value to extend the input map, but it does not change too much the results. It is
applied because it allows us to solve the problem of the fitting of the filter on the
image and it guarantees for instance to find with the application of the correct or the
right padding to maintain the original dimensionality of the input. So the output might
have the same dimension of the input and in this way we don't lose too much
information when passing from one convolutional layer to the other.

71. Stride
It is the number of pixel by which we slide a filter over an image between two
subsequent convolutions

72. Max pooling layer, why useful?


Pooling is a layer that is interposed between multiple convolutional layers and it can
be a norm which reduces the amount of numbers with which we deal.

23
The idea is that we have our filter, W, which is what we need to learn. The values
inside this filter are the numbers that the network has to learn during the training
process. Let's say here that we learned them, so we found these values. Then, the
filter passes through the image (input data) and we get an other set of values (feature
map). To calculate the feature map, we just have to take the filter and move it over
the input data and calculate these separate values.
Once we have our feature map then we apply max pooling.

Now for instance we can divide the image in parts and we search for the highest
value in each of these regions. This is exactly the effect of max pooling. Now, it is
important to underline that max pooling besides reducing the numbers with which we
need to deal, it has the effect of connecting region of the input data that were
originally far away from each other. By picking just a subpart of the image itself it
allows us to extend in practice the connection between local region putting close
together information that were possibly far in the original input data. And so, on one
side it gives us the effect of sparsification because as we see, we reduce the
amount of samples, and at the same time it has the effect of zooming out the image
itself. This means that in the following layer the information that will be exploited will
be at a higher scale level.

73. High level / Low level features, which layers capture which?
Early layers => Low level features
Deep layers => High level features

74. knn is linear or non-linear classifier?


Non-linear

75. knn and images


The reason why we can use kNN for images is because, although the curse of
dimensionality stays there, the data with which we are dealing with, have an inner
structure. So although they are described by a very high dimensional vector, data do

24
not occupy the all high dimensional space, they occupy just a subpart of it, which can
be for instance a very simple hyperplane or a more complex manifold surface.
The manifold, in practice, is just a particular surface for which locally Euclidean
distance or Euclidean metric hold. So globally Euclidean metrics do not hold but
locally hold, which means that we can still apply kNN.

76. Dropout

The idea here is that in each forward pass, we randomly set some neurons to zero.
So at each forward pass we have a different set of nodes that are activated or turned
off. And we have a parameter which is in general set to 0.5, which is the probability
with which the node is turned on or off. The dropout solution is generally applied for
fully-connected layers because they are actually the layers which have the largest
number of parameters. So we want to reduce them and in this way dropout is really
beneficial. It can also be applied in a convolutional layer but in that case instead of
let's say dropping random elements one thing that we can do is to drop the entire
feature maps, so the entire channel rather than random elements. But as we said, in
most of the cases the dropout is applied for the fully connected layers.

77. difference between deep model and shallow model


[Duplicated]
78. using max pooling and stride might have the same effect?
They both reduce dimensionality of the feature map, but with strided
convolution we are not assured to extract a value equal to the maximum of the
convolved area, since we extract the convolution product of such area with the
filter
79. unsupervised deep learning models (GAN, AutoEncoders and Variational
AutoEncoders)
They are generative model whose goal is to generate new samples from the
the training distribution, having the same probability distribution.
APPLICATIONS
- super resolution
- image colorization and impainting

GANs

25
They does not work with an explicit density function, but rather samples from a
simple distribution (e.g. random noise) and learns how to transform it to
training distribution.
The training consist in a two player game between two actors called the
generator and the discriminator. The generator generates new (fake) samples
from noise and the discriminator has to distinguish fake from actual samples.
The training consist in a min-max optimization: alternation between a step of
gradient ascent on the generator (loss function low if samples low quality) and
a step of gradient ascent on the discriminator (maximize the likelihood of
being wrong).

AUTOENCODERS
Unsupervised approach for learning a lower dimensional feature
representation from unlabeled training data, passing through a lower
dimensional latent space.
After training the decoder can be thrown away, and the encoder can be
fine-tuned together with a classifier and later used to initialize a supervised
task.

VARIATIONAL AUTOENCODERS
The difference w.r.t. autoencoders is that their latent space is continuous,
allowing easy random sampling. Sampling introduce stochasticity, so the
“decoded” image is always different from the encoded one.

80. difference at high level between GAN and AutoEncoders


GANs work with an implicit density function, while AutoEncoders deals with an
approximation of the explicit density function.
From the conceptual point of view, a GAN is a two player games solved by an
alternated min-max optimization where image are generated from a simple
distribution like noise, while AutoEncoders involves random sampling from the
continuous distribution of the features latent space and involves a
(probabilistic)-encoder / decoder architecture.

26

You might also like