Professional Documents
Culture Documents
ML Questions 2021
ML Questions 2021
ML Questions 2021
Solution A. When a predictor has an excellent performance on the training set but its
performance on the true world is very poor, then we are overfitting.
If the training samples at the end are large enough, we can further divide it into the
real training part and the validation part. The validation will have a similar role than
the testing, but will be part of the original training sample and we have access to the
validation during training.
Besides Cross Validation, there is another important aspect that we have to take into
consideration to avoid overfitting, and it is the complexity of the function that we are
defining. So when we learn something, a good thing is to try to control complexity by
penalizing a complex model, and this procedure is called the regularization.
Basically, when we learn something, in practice we are minimizing the Empirical
Risk, but we have to be careful and avoid complex solutions. Applying regularization
1
means including a term in the minimization process which is a term that measures
the complexity of the function h, and this term is λ . The lambda parameter is a
positive number that's that is used as a conversion rate between the loss and the
hypothesis complexity
3. Types of SVMs used for other purposes besides linear classification (pairwise
ranking SVM, outlier detection SVMs)
Some variants of a SVM named Support Vector Data Description (SVDD) and One
Class SVM (OCSVM) that are two strategies to identify the outliers in a distribution of
data.
SVDD: An intuitive way for outlier detection consists in estimating the density of the
data points in the feature space. The intuition is that samples in regions of high
density are inliers. Instead, samples in low density regions are outliers. In this SVM
variation the objective is to find the minimal circumscribing sphere around the data in
the feature space minimizing the radius.
OCSVM: One-Class SVM is nothing more than the standard SVM with just one class.
It corresponds exactly to an SVM where we want to maximize the margin of the
positive samples from the origin.
Actually there is a case in which the SVDD and the OCSVM are equivalent. This
happens where the data points are mapped from an input space to a feature space in
which all the points have the same distance from the origin and they can be
separated linearly from the origin.
Pairwise Ranking SVM: The supervision in the Pairwise Ranking SVM is composed
by pairs of images, and the label of each pair is the rank of one image with respect to
the others. All of these set of pairs are subject to some constraints. For each attribute
it's possible to learn a scoring function that, given the image x and a set of
parameters, returns the rank of that image. The scoring function can be learned
finding the best parameters that satisfy the constraints made with the pairs in the
training set. Since these constraints can be too strict and very complicated to
implement in a simple optimization problem, they are simplified with the max margin
relations using slack variables to tolerate errors. Minimizing the margin with these
constraints we will need to have the best ranking function.
An example might be the one where the Pairwise Ranking SVM is used to generate
precise textual information about an image. So for instance, given the following novel
image the algorithm will associate just a positive or a negative answer about the fact
that the image is dense.
2
In another example, the novel image is evaluated using several ranking functions
trained using the Pairwise Ranking SVM.
Given a novel image, all the learn ranking function are evaluated on it, and for each
attribute two reference images are identified. We can say that the novel image is
more dense than and less dense than other compared images, and not just is dense
or not dense.
In practice, the domain adaptation study exactly this kind of problem and try to find a
way to close the domain gap among the different kinds of images, or different kinds
of data, such that a model trained on one domain can then be reused efficiently also
on a different but related domain
Transfer Learning: the idea behind transfer learning is that of include some prior
knowledge inside a deep model, which also helps in the case in which we have a
limited amount of training samples. Basically, we are transferring knowledge from a
previously trained model on a new model. Nowadays this is practically the most used
technique. It's really the standard, it is not an exception, and we have to think about
the application of transfer learning not only for classification but for many other tasks,
like for instance detection or image captioning. Transfer learning can also be thought
as sort of regularization because somehow it improves the effectiveness of
generalization for the model by transferring knowledge from another pre-existing
model. So, it is a sort of initialization and helps to generalize.
3
5. How the freezing of the layers is made in transfer learning?
Layer freezing means layer weights of a trained model are not changed when they
are reused in a subsequent downstream task - they remain frozen. Essentially when
backpropagation is done during training these layers weights are untouched.
E.g. when we perform transfer learning from a neural network trained on a
(usually bigger) dataset to make it classify a new dataset (usually smaller).
(E.g. Homework 2)
6. Given noisy examples that I want to classify with SVM, how best to set the
value of the C parameter? Explain slack variables.
SVM: we might have multiple separating lines
Probably we might have some of these solutions by the perceptron. In SVM it's like if
we take our central separating hyperplane and we generate two other hyperplanes
on the margin.
Choosing this hyperplane, rather than others, our expectation is that if new data
arrives, of course there will be some misclassifications, but it should do better than all
the others ones.
4
If training data includes this point, it's just that linear separation is not possible.
Having real data, we might have some noise so we give ourselves the permission to
make some mistakes. Soften a little bit we will find out that we still have a convex
optimization problem. So, even if we made our margin softer and we can still solve
the problem with exactly the same techniques we have seen so far. What we need to
do is minimizing the amount of slack.
What is changing is that we are limiting the value of α. The effect of C, that is the
constant that determines the number of slack variables, is affecting the values
that the α_i can take. We also know that the α_i are the coefficients that determine
among all the training data what point is going to be in the solution and what is
not. So, given that the slack variables are the mistakes that we are allowing to
make, C is proportional to the mistakes that we are making and that this coefficient
tells me how soft our margin is or how hard it is. And we say that the hard margin is
when we cannot make any mistakes. The larger C the more mistakes we can do is
related to the values that alpha can take, because the values of alpha determine how
sparse or non-sparse our solution is. A sparse solution is a solution with very few
support vectors. What we expect is that when we have a soft margin classifier we
will have a lot of support vectors, because we are allowing a lot of mistakes.
In Figure 1 we have the data points that are in the Euclidean space, so the k-means
algorithm can be applicable. The number of cluster k, is chosen to be 2. The first
allocation of the centroids is done randomly (Figure 2).
5
For each data point is computed the Euclidean distance between the point and the
two centroids and for each point is associated the nearest centroid (Figure 3).
Once all the data points are assigned, the centroids are recomputed. Actually the
centroids are computed with the average between all the points assigned with the
centroid considered (Figure 4).
With the new clusters (Figure 5), the distance between each data points and the
centroids are recomputed and the point assignment is repeated. The centroids are
updated until we arrive in a quite good clustering solution in which the data points are
assigned to more iteration to the same cluster.
How to choose k?
This evaluation consists in keeping track of the average of the distances of the data
points and their centroid increasing the number of k.
At the end, plotting the graph of the values computed, we can observe that this is not
a real evaluation of the error because, as we said, in clustering there are no labels,
so we cannot really say what is wrong or not.
6
Convergence
The algorithm terminates when there are quite small oscillations around the one local
minimum that, most of the time, is a good clustering solution. But, it can happen that
a particular initialization or seed, results in a suboptimal clustering solution. So, it is
important to select good seeds using heuristic or trying multiple starting points.
This allows us to reduce the number of connections. So, at this point will not have a
huge matrix W, but we will have just a small kernel that will be actually repeated and
moved around in the image
The effect of this reduction in terms of parameters of course will become even more
clear when we combine all the layers together, because each convolutional layer
takes as input the feature map of the previous layer and since we have reduced
somehow the amount of parameters of the previous layers then the effect will be
extended over the whole network. This really helped to decrease the number of
parameters with which we need to deal.
There is a further element that allows to reduce the amount of parameters of the
network itself which is pooling. The pooling layers are generally introduced among
one convolutional layer and the other.
The effect of pooling is just that of accumulating information and one way to do that
is, for instance, by calculating a norm.
7
The convolution operation is actually a linear combination of a filter matrix and
elements of an input matrix. Taken in the context of a convolutional neural
network, you have a filter (which could be of different sizes) that is used to
subject the input matrix to this convolution operation.
If you don’t have an activation function in your convolutional neural network,
you will have what’s essentially a linear activation function (y=x) end up
computing a simple linear combination of the inputs (regardless of how many
layers you have in your convolution or other deep network).
10. Logistic Regression, explain what it’s and why it’s called “regression” but is a
classifier
Logistic regression is a generalized linear model using the same basic
formula of linear regression but it is regressing for the probability of a
categorical outcome.
11. About Deep Learning and Neural Network, give an answer about a question
(true or false, and explain): can be a Neural Network sensible about the exact
location on the image (right-up corner, center, ecc..) of the subject?
No. If we use convolutional layers this extract low-level patterns independently on
their location on the image.
13. what is K-means clustering and why is it guaranteed to converge after a certain
number of steps?
Since it is an L2 loss, so we use an Euclidean distance or a Minkowski distance
which are convex, it is guaranteed to converge after a certain number of steps, which
depends also on the initialization of the centroids
8
In practice, we make a prediction just on the basis of the distance between the
square and the closest sample and by looking at the label of the closest sample. This
is how it works the nearest neighbor classifier. To make the prediction we can simply
consider the possible pairs of training sample and the line that connects them, and
then consider an orthogonal line that divides this segment into equal parts.
This cell is known as the Voronoi cell (Figure A). This procedure can be repeated
over all our training set to define what is called the Voronoi tessellation (Figure B). At
the end what we get is a very precise separation curve between the blue triangles
and the red dots.
In case, for instance we have in our training set a mislabeled sample, then we need
to introduce a second boundary to separate the single sample with respect to the
other class. For this reason, very small mistakes can change a lot the way in which
we end up dividing the two classes.
One way to avoid these mistakes due to outliers is that, instead of considering just
the single closest sample to each test instance, we consider a set of k most similar
training samples. So we pass from what is called one nearest neighbor to k nearest
neighbor.
How to choose k?
The value of k makes a substantial difference in the final prediction. When we
change k then what we do is to changing the boundary in the prediction.
A small k means that the method is more subject to outliers and so it tends to be
more precise in the description of the boundary but also more sensitive to the
outliers. When, instead, we consider a k which is larger, we will see that the
boundaries will be smoother and the prediction will also be less sensitive to outliers.
So, selecting the value of k consisting in just applying a cross validation procedure.
This means that k is a parameter that we have to tune depending on the prediction
that we get on a validation set. So we can validate our choice and then take the k
value that allow us to get the maximum validation prediction and this means that we
have a good indication of how the method will perform in terms of generalizations on
new test samples.
9
Choosing a k value which is odd we hope to unbalance the prediction giving more
samples to one of the two classes for the binary prediction.
Pros:
● kNN is a very simple classifier
● everything is based on distances and we make the hypothesis that nearby
regions of space have the same class. This is the only basic assumption on
which kNN is built.
● It's not a parametric approach, so differently with respect to SVM or the
perceptron, we are not taking the data, building a model on the basis of some
parameters and then we can even forget about the data, just relying on the
model. For kNN is different. We need to keep the data because everything is
based on the distance between our test samples and the training samples.
The positive aspect is that we don't have to infer some parameters, except for
the choice of k, we just need to keep the data, and if the training samples
change, eventually, if we plan to add new categories in our training samples,
we don’t need to retrain our model.
● Moreover kNN has a very good generalization guarantee, but it can be
shown that when n goes to infinity, so when the number of samples goes to
infinity, then the kNN model has an error which is less than twice the optimal
Bayes error. So we also have some good generalization guarantee for this
model when we have a very large number of training sample.
Cons:
● The time complexity of the method can be quite high, exactly depending on
the cost of computing each distance. So if this is costly, and most often it is,
then we have to multiply this cost by the number of training examples
because we have to understand which is the closest sample of our test
instance, and so it means that the overall procedure it's quite expensive.
● we don't have a training procedure, we only have a testing part. We just have
to take our test sample, calculate the distance and make a decision. So the
cost is exactly at test time, while in general a parametric model tends to be
more costly during the training and then quite effective and quite fast during
tests. Moreover, the larger is the dimensionality with which we are dealing
with, the more complex the calculus become, and this always scales linearly
with the number of samples we are dealing with. So of course this tends to be
quite expensive.
Perceptron:
where w are our synaptic weight, x our inputs, b the bias in order to be invariant for
translations and the sigma function a threshold.
We want a decision function that is non-linear and that's why we put a threshold and
a bias. The sigma function is any nonlinear function that makes sure that the f(x) is
between (0, 1) or (-1, +1). When we have a decision function like this, what we are
really building is a classifier that finds a linear separating hyperplane.
10
In practice, what do we learn with a perceptron is estimating the parameters w and b.
Once we know those, we have our classifier. What we are going to learn is stuff like
only one of the lines depicted in the figure on the left and the reason of all of these
lines is that all of them are possible and legitimate solutions.
When the prediction that we made are different from the true label (yi), then the w
that we have in memory doesn’t return the right prediction, so we need to update it.
So, we take the original w and we sum yi xi. The same is done to the b value. The
idea is that if we substitute w and b, then the result of the multiplication, between the
decision function and the true label, would become positive.
What really matters in a perceptron is the mistakes, when something good happens
we don't do anything when something bad happens we store something. In practice,
we store the errors. That also means that if you give me 100.000 data and my
perceptron makes mistakes on 10 points, from 100.000 points all I need to store is
10. We can see these terms as the memory of my algorithm, which is what we need
to remember about the training data. This also means that we can compress the
data.
It's not a good idea to use a perceptron with data that are noisy.
11
16. CNN are rotation invariant?
No, just translation invariant, for that we use Data Augmentation. CNN can learn to
recognize an object in an image no matter how the object is rotated (in the image
place) even if the training set only includes the object in one orientation.
C. was the best because it optimized on validation set but actually tested it on test
set
18. Question about deep learning: how to find the best parameters?
Grid Search, Cross Validation
12
(changing the metrics too) and my hope is that by changing the space where we
represent the data in that space our problem will become linearly separable.
If we know how to define non-linear scalar products we can make non-linear any
algorithm.
We can always build a mapping ϕ(x) for any polynomial of order d via 〈x,x' 〉^d, even though
we don't define it explicitly.
23. QUIZ
a. One colleague reduce the number of feature maps on the first layer in
the network (Convolutional layer);
b. one colleague reduce the number of feature maps on the last layer in
the network (Fully connected layer).
What is the best choice?
In the fully connected layer
The fully-connected layer tells us that each element of the input connects with each
element of the output and so contribute to each element of the output.
We have this huge amount of weights here. The edges are fully connected, so we
have this full connection and we need a weight for each of these edges. Most
probably we don't need to strictly have all these connections, we might be able to
reduce the amount of connection and possibly share weights.
The possibility to use this convolutional layer rather than the only fully connected
layer comes from the nature of the data themselves. For instance, the fact that a local
structure inside an image is repeated. So maybe instead of having a very huge W
matrix containing a large amount of weights, maybe we can take a smaller matrix W
that repeats multiple times for instance. This is exactly the idea of how to pass from
fully connected layer to convolutional layer using matrices, which are called filters
and they are smaller with respect to just having the huge W matrix.
13
In Linear Regression we have to minimize the mean squared error MSE
between the predicted y_hat and the ground truth y.
Ridge Regression:
In Ridge Regression we have to minimize the MSE, like in Linear Regression,
but in this case the optimization objective consists also in a L2 norm
penalization for the weight. This leads to a weight shrinkage and has a
regularization effect.
EXTRA (i)
Lasso Regression is similar to Ridge Regression but applies L1 norm. In this
case, beside weight shrinkage, we also have feature selection through setting
some weigths = 0 due to the L1 regularization.
EXTRA (ii)
ElasticNet is the combination of L1 and L2 weight norm penalization.
25. Given an input img 32x32x3 and a filter 5x5 with stride 1. What is the
dimensionality of the output of this convolutional filter?
(32 - 5)/1 + 1 = 28 => 28x28
Formula => (N-F)/Stride + 1
28. SVM: linear classifier, what happens when we remove one of the point very
close to decision boundaries? Also what happens if non-linear classification.
if linear classifier, ????????
31. K fold cross validation & Leave one out cross validation
In some cases the division into a training and the validation set might not be a good
choice because to do this separation we are reducing the amount of data in which we
can really train the model itself.
14
One thing that we can do is that we can divide the cross validation setting in multiple
equal parts and use only one of the parts as validation samples while the remaining
parts will be the real training. So instead of having a fixed validation, now we have
multiple validation because in turn we can choose each of these subsets as the
validation part of the training set. The approach is called the k-fold because we are
dividing the set of data into k parts, where only one is used as validation, while the
remaining k - 1 works as training. This is done in turn for all the k fold and then, to
refine the model, we take the average of the output out of these k predictions. In
practice, we can decide what is the best model just taking exactly the average of the
output over all the folds (green part of the previous figure).
In the case in which k is exactly equal to the cardinality, S, of our training sample,
then we have what is called the Leave-One-Out Cross Validation. In practice, every
single sample is kept out while the remaining S - 1 samples define the training model
and in turn each single sample is the validation set. The Leave-One-Out error is an
unbiased estimator of the generalization error and this is important because
everything we are interested in is the ability of our model to be a good predictor on
future samples. So, we want our model to generalize well. By using the Leave-One-
Out cross validation procedure we have a hint of how good is our model to
generalize.
34. Do you think that adding more layers in a CNN would help to prevent
overfitting?
No
15
hint: justify the previous answer
37. If I put one new sample for SVM what happens?
38. logistic regression
a. what it is?
b. if I give you a new sample, completely different and far from the others,
how does it influence the classifier?
The classifier changes because of that
c. what about if I had a SVM? What would happen? (The classifier in that
case does not change)
The classifier in that case does not change
39. tell me what’s going on in these three situations
Solution:
Model A: high learning rate
Model B: overfitting
Model C: underfitting / low learning rate
16
learns to model the data, we can sample from the distribution and
generate new input data samples. So it is a generative model like, for
instance, GANs.
Autoencoder is neural network that can be used to reduce the data into a low
dimensional latent space by stacking multiple non-linear transformations(layers).
They have an encoder-decoder architecture. The encoder maps the input to latent
space and decoder reconstructs the input.
Comparison
1. PCA is essentially a linear transformation but Auto-encoders are capable of
modelling complex non linear functions.
2. PCA features are totally linearly uncorrelated with each other since features
are projections onto the orthogonal basis. But autoencoded features might
have correlations since they are just trained for accurate reconstruction.
3. PCA is faster and computationally cheaper than autoencoders.
4. A single layered autoencoder with a linear activation function is very similar to
PCA.
5. Autoencoder is prone to overfitting due to high number of parameters.
(though regularization and careful design can avoid this)
In KNN we don't really have a training procedure but we only have a testing part. We
just have to take our test sample, calculate the distance and make a decision. So the
17
cost is exactly at test time, while in general a parametric model tends to be more
costly during the training and then quite effective and quite fast during tests. For this
reason the problem for KNN is that the expensive part is exactly a test time and this
most often is a problem because it means that we cannot quickly apply the model at
test time, we'll always have to calculate the distances to be able to make a decision.
There are some trick to make kNN faster. In some cases we can make the procedure
a little bit cheaper if for instance we have some knowledge about the dimensionality
of our space and we know that out of all this d dimension there are only a subset r
which is really important. So in this sense, if we have this knowledge, then we can
reduce the calculus of the distance relying only on r elements rather than the whole
set of d elements.
Another possibility is that we know some structure of our training data, so we might
be able to organize our training data into trees for instance, or maybe we know that
there are some samples, of a certain class, that are more representative than others,
so they behave like prototypes and maybe we might only refer to that rather than
going over all the distances to the other examples. But in general having this
knowledge is not simple because it's costly the fact that we have to build the tree or
we have to extract the knowledge about prototypes.
One other thing that we can consider is editing or pruning the training samples. For
instance, if we know that there is a sample that happens to be in a region where all
the closest sample belong to the same class, then maybe we might decide to remove
this sample. The result will be that the Voronoi cell of the closest sample will enlarge
a little bit. So, we are editing, we are pruning, we are removing some information but
it still might be good. We cannot have any guarantee on which is the correct number
of elements to remove and we might end up having some problems if for instance our
training sample changes and we want to add new categories.
And there are many practical tools to make a kNN faster. For instance FLANN is a
fast library for approximate Nearest Neighbour or also ANN.
18
even if that hyperplane misclassifies more points. For very tiny values of C, you
should get misclassified examples, often even if your training data is linearly
separable.
50. structure of a neural network (answer: fully connected and convolutional) and
(broadly) how they work.
a. Which one has more parameters (FC).
51. ranking SVM, what it tries to predict?
order between data points, more specifically +1 if x>y, -1 if x<y
52. description of GAN, difference between GAN an VAE
53. difference between Logistic Regression and Linear Regression
First of all
- Linear Regression => it is a regression algorithm
- Logistic Regression => it is a (binary) classification algorithm
Linear regression objective is to predict a model that minimize the error in
predicting a continuous values for each sample under test by fitting a line into
the feature space.
Logistic regression’s objective, instead, consists in assigning a binary class
label {0,1} or {-1,+1} to each of the samples under test. It is based on the
sigmoid function that takes as input z=<w,x>+b and assigns x to 1 if z>0.5
or 0 otherwise.
55. RNN
HIGH-LEVEL DESCRIPTION
Recurrent Neural Networks are a special class of NN characterized by internal
self-connections.
As an RNN processes sequential information, it performs the same operation
on every element of the input sequence. Its output, at each time step, depends
on the previous input and past computations. This allows the network to
develop a memory of previous events, which is encoded in its hidden state
variables.
BACKPROPAGATION THROUGH TIME
Gradient-based learning requires a closed-form between the model parameters
and the loss function.
In order to find a direct relation between the loss function and the network
weights, the RNN has to be represented as a DAG (directed acyclic graph). This
procedure is called unfolding and consist of replicating the network’s hidden
layer structure for each time interval, obtaining a particular kind of feed-
forward neural network.
TRAINING
19
The training of an RNN does not differ from the one employed from standard
NNs. (e.g. SGD with momentum, weight decay, dropout, regularization terms..)
59. Kernel trick for SVM. What it is and why we use it?
we use it when we have non-linear problems.
60. What can be done if data to separate are not linear using SVM?
Apply the kernel trick for SVM
20
61. SVM is a binary multiclass classifier?
In its most simple type, SVM doesn’t support multiclass classification natively. It
supports binary classification and separating data points into two classes
21
67. Gradient descent vs stochastic gradient descent
Gradient descent is a first order iterative algorithm
1) Initialize a weight vector θ
2) Iteratively compute θ(t +1)=θ(t )−α Gradient ( Loss)
3) Stop when a minimum is reached
Computational bottlenecks in terms of #samples w.r.t. which compute partial
derivatives and #parameters to be updated, at each step.
STOCHASTIC Gradient Descent: compute Gradient ( Loss) for a small
representative subsample m<<n of the samples
- the minibatch is drawn uniformly
- the true gradient is approximated, but significant speedup
REMARK
- SGD does not stop at the minimum due to noise induced by the random
sampling
- SGD generalizes better than standard GD because it does not depend on the
number of samples
22
In this example we cannot apply a 3x3 filter on a 7x7 input with stride 3 because the
filter exit from the image. If N is the number of components vertically and horizontally
for our input and F is the dimension of the filter and we know the stride, then to get
the output size we just need to apply the following rule:
(N-F)/stride+1
Let’s apply this rule to the previous examples with N = 7, F = 3:
Stride 1 => (7 - 3)/1+1=5
Stride 2 => (7 - 3)/2+1=3
Stride 3 => (7 - 3)/3+1=2.33
when we try stride 3, we see immediately that the ratio does not provide us an
integer number and, for this reason, we don't have a good fitting. In these cases, one
thing that we can do, is that of including zero padding on the border.
This eventually might introduce a bit of noise because we are arbitrary choosing a
value to extend the input map, but it does not change too much the results. It is
applied because it allows us to solve the problem of the fitting of the filter on the
image and it guarantees for instance to find with the application of the correct or the
right padding to maintain the original dimensionality of the input. So the output might
have the same dimension of the input and in this way we don't lose too much
information when passing from one convolutional layer to the other.
71. Stride
It is the number of pixel by which we slide a filter over an image between two
subsequent convolutions
23
The idea is that we have our filter, W, which is what we need to learn. The values
inside this filter are the numbers that the network has to learn during the training
process. Let's say here that we learned them, so we found these values. Then, the
filter passes through the image (input data) and we get an other set of values (feature
map). To calculate the feature map, we just have to take the filter and move it over
the input data and calculate these separate values.
Once we have our feature map then we apply max pooling.
Now for instance we can divide the image in parts and we search for the highest
value in each of these regions. This is exactly the effect of max pooling. Now, it is
important to underline that max pooling besides reducing the numbers with which we
need to deal, it has the effect of connecting region of the input data that were
originally far away from each other. By picking just a subpart of the image itself it
allows us to extend in practice the connection between local region putting close
together information that were possibly far in the original input data. And so, on one
side it gives us the effect of sparsification because as we see, we reduce the
amount of samples, and at the same time it has the effect of zooming out the image
itself. This means that in the following layer the information that will be exploited will
be at a higher scale level.
73. High level / Low level features, which layers capture which?
Early layers => Low level features
Deep layers => High level features
24
not occupy the all high dimensional space, they occupy just a subpart of it, which can
be for instance a very simple hyperplane or a more complex manifold surface.
The manifold, in practice, is just a particular surface for which locally Euclidean
distance or Euclidean metric hold. So globally Euclidean metrics do not hold but
locally hold, which means that we can still apply kNN.
76. Dropout
The idea here is that in each forward pass, we randomly set some neurons to zero.
So at each forward pass we have a different set of nodes that are activated or turned
off. And we have a parameter which is in general set to 0.5, which is the probability
with which the node is turned on or off. The dropout solution is generally applied for
fully-connected layers because they are actually the layers which have the largest
number of parameters. So we want to reduce them and in this way dropout is really
beneficial. It can also be applied in a convolutional layer but in that case instead of
let's say dropping random elements one thing that we can do is to drop the entire
feature maps, so the entire channel rather than random elements. But as we said, in
most of the cases the dropout is applied for the fully connected layers.
GANs
25
They does not work with an explicit density function, but rather samples from a
simple distribution (e.g. random noise) and learns how to transform it to
training distribution.
The training consist in a two player game between two actors called the
generator and the discriminator. The generator generates new (fake) samples
from noise and the discriminator has to distinguish fake from actual samples.
The training consist in a min-max optimization: alternation between a step of
gradient ascent on the generator (loss function low if samples low quality) and
a step of gradient ascent on the discriminator (maximize the likelihood of
being wrong).
AUTOENCODERS
Unsupervised approach for learning a lower dimensional feature
representation from unlabeled training data, passing through a lower
dimensional latent space.
After training the decoder can be thrown away, and the encoder can be
fine-tuned together with a classifier and later used to initialize a supervised
task.
VARIATIONAL AUTOENCODERS
The difference w.r.t. autoencoders is that their latent space is continuous,
allowing easy random sampling. Sampling introduce stochasticity, so the
“decoded” image is always different from the encoded one.
26