Professional Documents
Culture Documents
DUnit IV
DUnit IV
DUnit IV
Encoder - Decoder
Introduction Auto-encoder
Autoencoders are a specific type of feedforward neural networks where the input is the same
as the output. They compress the input into a lower-dimensional code and then reconstruct the
output from this representation. The code is a compact “summary” or “compression” of the
input, also called the latent-space representation.
To build an autoencoder we need 3 things: an encoding method, decoding method, and a loss
function to compare the output with the target.
Data-specific: Autoencoders are only able to meaningfully compress data similar to what
they have been trained on. Since they learn features specific for the given training data,
they are different than a standard data compression algorithm like gzip. So we can’t
expect an autoencoder trained on handwritten digits to compress landscape photos.
Lossy: The output of the autoencoder will not be exactly the same as the input, it will be
a close but degraded representation. If you want lossless compression they are not the
way to go.
Architecture of Auto-encoder
Let’s explore the details of the encoder, code and decoder. Both the encoder and decoder are
fully-connected feedforward neural networks. Code is a single layer of an ANN with the
dimensionality of our choice. The number of nodes in the code layer (code size) is a hyper
parameter that we set before training the auto encoder.
There are 4 hyper parameters that we need to set before training an auto encoder:
Code size: number of nodes in the middle layer. Smaller size results in more
compression.
Number of layers: the autoencoder can be as deep as we like. In the figure above we
have 2 layers in both the encoder and decoder, without considering the input and output.
Number of nodes per layer: the autoencoder architecture we’re working on is called
a stacked autoencoder since the layers are stacked one after another. Usually stacked
autoencoders look like a “sandwitch”. The number of nodes per layer decreases with each
subsequent layer of the encoder, and increases back in the decoder. Also the decoder is
symmetric to the encoder in terms of layer structure. As noted above this is not necessary
and we have total control over these parameters.
Loss function: we either use mean squared error (mse) or binary cross entropy. If the
input values are in the range [0, 1] then we typically use cross entropy, otherwise we use
the mean squared error.
Under complete auto encoders, with code dimension less than the input dimension, can learn
the most salient features of the data distribution. We have seen that these auto encoders fail to
learn anything useful if the encoder and decoder are given too much capacity.
Stochastic encoder
• We can also generalize the notion of an encoding function f(x) to an encoding distribution
pencoder(h|x)
• Both the encoder and decoder are not simple functions but involve a distribution
• The output is sampled from a distribution pencoder(h|x) for the encoder and pdecoder(x|h)
for the decoder
The denoising autoencoder gets rid of noise by learning a representation of the input where the
noise can be filtered out easily.
While removing noise directly from the image seems difficult, the autoencoder performs this by
mapping the input data into a lower-dimensional manifold (like in undercomplete autoencoders),
where filtering of noise becomes much easier.
Essentially, denoising autoencoders work with the help of non-linear dimensionality reduction.
The loss function generally used in these types of networks is L2 or L1 loss.
4. Contractive Autoencoders :
Similar to other autoencoders, contractive autoencoders perform task of learning a representation
of the image while passing it through a bottleneck and reconstructing it in the decoder.
The contractive autoencoder also has a regularization term to prevent the network from learning
the identity function and mapping input into the output.
Contractive autoencoders work on the basis that similar inputs should have similar encodings
and a similar latent space representation. It means that the latent space should not vary by a huge
amount for minor variations in the input.
The gradient is summed over all training samples, and a frobenius norm of the same is taken.
Applications of autoencoders
Now that you understand various types of autoencoders, let’s summarize some of their most
common use cases.
1. Dimensionality reduction
Undercomplete autoencoders are those that are used for dimensionality reduction.
The input and the output dimension have 3000 dimensions, and the desired reduced dimension
is 200. We can develop a 5-layer network where the encoder has 3000 and 1500 neurons a
similar to the decoder network.
The vector embeddings of the compressed input layer can be considered as a reduced
dimensional embedding of the input layer.
The input and the output dimension have 3000 dimensions, and the desired reduced dimension
is 200. We can develop a 5-layer network where the encoder has 3000 and 1500 neurons a
similar to the decoder network.
The vector embeddings of the compressed input layer can be considered as a reduced
dimensional embedding of the input layer.
2. Image denoising
Autoencoders like the denoising autoencoder can be used for performing efficient and highly
accurate image denoising.
The noisy input image is fed into the autoencoder as input and the output noiseless output is
reconstructed by minimizing the reconstruction loss from the original target output (noiseless).
Once the autoencoder weights are trained, they can be further used to denoise the raw image.
4. Anomaly Detection
Under complete autoencoders can also be used for anomaly detection.
Denoising autoencoders can be used to impute the missing values in the dataset. The idea is to
train an autoencoder network by randomly placing missing values in the input data and trying
to reconstruct the original raw data by minimizing the reconstruction loss.
Once the autoencoder weights are trained the records having missing values can be passed
through the autoencoder network to reconstruct the input data, that too with imputed missing
features.
Dimensionality Reduction is the process of reducing the number of dimensions in the data
either by excluding less useful features (Feature Selection) or transform the data into lower
dimensions (Feature Extraction). Dimensionality reduction prevents overfitting. Overfitting is
a phenomenon in which the model learns too well from the training dataset and fails to
generalize well for unseen real-world data.
AutoEncoders
Principal Component Analysis (PCA)
Linear Determinant Analysis (LDA)
In this post, let us elaborately see about AutoEncoders for dimensionality reduction.
AutoEncoders
AutoEncoder is an unsupervised Artificial Neural Network that attempts to encode the data
by compressing it into the lower dimensions (bottleneck layer or code) and then decoding the
data to reconstruct the original input. The bottleneck layer (or code) holds the compressed
representation of the input data.
In AutoEncoder the number of output units must be equal to the number of input units since
we’re attempting to reconstruct the input data. AutoEncoders usually consist of an encoder
and a decoder. The encoder encodes the provided data into a lower dimension which is the
size of the bottleneck layer and the decoder decodes the compressed data into its original
form.
The number of neurons in the layers of the encoder will be decreasing as we move on with
further layers, whereas the number of neurons in the layers of the decoder will be increasing
as we move on with further layers. There are three layers used in the encoder and decoder in
the following example. The encoder contains 32, 16, and 7 units in each layer respectively
and the decoder contains 7, 16, and 32 units in each layer respectively. The code size/ the
number of neurons in bottle-neck must be less than the number of features in the data.
Before feeding the data into the AutoEncoder the data must definitely be scaled between 0
and 1 using MinMaxScaler since we are going to use sigmoid activation function in the
output layer which outputs values between 0 and 1.
When we are using AutoEncoders for dimensionality reduction we’ll be extracting the
bottleneck layer and use it to reduce the dimensions. This process can be viewed as feature
extraction.
Deep Autoencoder
Sparse Autoencoder
Under complete Autoencoder
Variational Autoencoder
LSTM Autoencoder
Hyperparameters of an AutoEncoder
Deep learning is the subfield of machine learning which is used to perform complex
tasks such as speech recognition, text classification, etc. A deep learning model consists of
activation function, input, output, hidden layers, loss function, etc. Any deep learning model
tries to generalize the data using an algorithm and tries to make predictions on the unseen
data. We need an algorithm that maps the examples of inputs to that of the outputs and an
optimization algorithm. An optimization algorithm finds the value of the parameters(weights)
While training the deep learning model, we need to modify each epoch’s weights and
minimize the loss function. An optimizer is a function or an algorithm that modifies the
attributes of the neural network, such as weights and learning rate. Thus, it helps in reducing
the overall loss and improve the accuracy. The problem of choosing the right weights for the
model is a daunting task, as a deep learning model generally consists of millions of
parameters. It raises the need to choose a suitable optimization algorithm for your
application. Hence understanding these algorithms is necessary before having a deep dive
into the field.
You can use different optimizers to make changes in your weights and learning rate.
However, choosing the best optimizer depends upon the application. As a beginner, one evil
thought that comes to mind is that we try all the possibilities and choose the one that shows
the best results. This might not be a problem initially, but when dealing with hundreds of
gigabytes of data, even a single epoch can take a considerable amount of time. So randomly
choosing an algorithm is no less than gambling with your precious time that you will realize
sooner or later in your journey.
In this guide, we will learn about different optimizers used in building a deep learning model
and the factors that could make you choose an optimizer instead of others for your
application.
We are going to learn the following optimizers in this guide and look at the pros and cons of
each algorithm. So that, at the end of the article, you will be able to compare various
optimizers and the procedure they are based upon.
1. Gradient Descent
2. Stochastic Gradient Descent
3. Stochastic Gradient descent with momentum
4. Mini-Batch Gradient Descent
5. Adagrad
6. RMSProp
7. AdaDelta
8. Adam
Before proceeding there are few terms that you should be familiar with.
Epoch – The number of times the algorithm runs on the whole training dataset.
Batch – It denotes the number of samples to be taken to for updating the model parameters.
Cost Function/Loss Function – A cost function is used to calculate the cost that is the
difference between the predicted value and the actual value.
Weights/ Bias – The learnable parameters in a model that controls the signal between two
neurons.
Gradient Descent can be considered as the popular kid among the class of optimizers. This
optimization algorithm uses calculus to modify the values consistently and to achieve the
local minimum. Before moving ahead, you might have the question of what a gradient is?
In simple terms, consider you are holding a ball resting at the top of a bowl. When you lose
the ball, it goes along the steepest direction and eventually settles at the bottom of the bowl.
A Gradient provides the ball in the steepest direction to reach the local minimum that is the
bottom of the bowl.
The above equation means how the gradient is calculated. Here alpha is step size that
represents how far to move against each gradient with each iteration.
1. It starts with some coefficients, sees their cost, and searches for cost value lesser than
what it is now.
2. It moves towards the lower weight and updates the value of the coefficients.
3. The process repeats until the local minimum is reached. A local minimum is a point
beyond which it can not proceed.
In this variant of gradient descent instead of taking all the training data, only a subset of the
dataset is used for calculating the loss function. Since we are using a batch of data instead of
taking the whole dataset, fewer iterations are needed. That is why the mini-batch gradient
descent algorithm is faster than both stochastic gradient descent and batch gradient descent
algorithms. This algorithm is more efficient and robust than the earlier variants of gradient
descent. As the algorithm uses batching, all the training data need not be loaded in the
memory, thus making the process more efficient to implement. Moreover, the cost function in
mini-batch gradient descent is noisier than the batch gradient descent algorithm but smoother
than that of the stochastic gradient descent algorithm. Because of this, mini-batch gradient
descent is ideal and provides a good balance between speed and accuracy.
Despite, all that, the mini-batch gradient descent algorithm has some downsides too. It needs
a hyperparameter that is “mini-batch-size”, which needs to be tuned to achieve the required
accuracy. Although, the batch size of 32 is considered to be appropriate for almost every
case. Also, in some cases, it results in poor final accuracy. Due to this, there needs a rise to
look for other alternatives too.
The adaptive gradient descent algorithm is slightly different from other gradient descent
algorithms. This is because it uses different learning rates for each iteration. The change in
learning rate depends upon the difference in the parameters during training. The more the
parameters get change, the more minor the learning rate changes. This modification is highly
beneficial because real-world datasets contain sparse as well as dense features. So it is unfair
to have the same value of learning rate for all the features. The Adagrad algorithm uses the
below formula to update the weights. Here the alpha(t) denotes the different learning rates at
each iteration, n is a constant, and E is a small positive to avoid division by 0.
One downside of AdaGrad optimizer is that it decreases the learning rate aggressively and
monotonically. There might be a point when the learning rate becomes extremely small. This
is because the squared gradients in the denominator keep accumulating, and thus the
denominator part keeps on increasing. Due to small learning rates, the model eventually
becomes unable to acquire more knowledge, and hence the accuracy of the model is
compromised.
At the end of the previous section, you learned why using gradient descent on massive data
might not be the best option. To tackle the problem, we have stochastic gradient descent. The
term stochastic means randomness on which the algorithm is based upon. In stochastic
gradient descent, instead of taking the whole dataset for each iteration, we randomly select
the batches of data. That means we only take few samples from the dataset.
The procedure is first to select the initial parameters w and learning rate n. Then randomly
shuffle the data at each iteration to reach an approximate minimum.
Since we are not using the whole dataset but the batches of it for each iteration, the path took
by the algorithm is full of noise as compared to the gradient descent algorithm. Thus, SGD
uses a higher number of iterations to reach the local minima. Due to an increase in the
number of iterations, the overall computation time increases. But even after increasing the
number of iterations, the computation cost is still less than that of the gradient descent
optimizer. So the conclusion is if the data is enormous and computational time is an essential
factor, stochastic gradient descent should be preferred over batch gradient descent algorithm.
In the above image, the left part shows the convergence graph of the stochastic gradient
descent algorithm. At the same time, the right side shows SGD with momentum. From the
image, you can compare the path chosen by both the algorithms and realize that using
momentum helps reach convergence in less time. You might be thinking of using a large
momentum and learning rate to make the process even faster. But remember that while
increasing the momentum, the possibility of passing the optimal minimum also increases.
This might result in poor accuracy and even more oscillations.
RMS prop is one of the popular optimizers among deep learning enthusiasts. This is maybe
because it hasn’t been published but still very well know in the community. RMS prop is
ideally an extension of the work RPPROP. RPPROP resolves the problem of varying
gradients. The problem with the gradients is that some of them were small while others may
be huge. So, defining a single learning rate might not be the best idea. RPPROP uses the sign
of the gradient adapting the step size individually for each weight. In this algorithm, the two
gradients are first compared for signs. If they have the same sign, we’re going in the right
direction and hence increase the step size by a small fraction. Whereas, if they have opposite
signs, we have to decrease the step size. Then we limit the step size, and now we can go for
the weight update.
The problem with RPPROP is that it doesn’t work well with large datasets and when we want
to perform mini-batch updates. So, achieving the robustness of RPPROP and efficiency of
mini-batches at the same time was the main motivation behind the rise of RMS prop. RMS
prop can also be considered an advancement in AdaGrad optimizer as it reduces the
monotonically decreasing learning rate.
where gamma is the forgetting factor. Weights are updated by the below formula
In simpler terms, if there exists a parameter due to which the cost function oscillates a lot, we
want to penalize the update of this parameter. Suppose you built a model to classify a variety
of fishes. The model relies on the factor ‘color’ mainly to differentiate between the fishes.
Due to which it makes a lot of errors. What RMS Prop does is, penalize the parameter ‘color’
so that it can rely on other features too. This prevents the algorithm from adapting too quickly
to changes in the parameter ‘color’ compared to other parameters. This algorithm has several
benefits as compared to earlier versions of gradient descent algorithms. The algorithm
converges quickly and requires lesser tuning than gradient descent algorithms and their
variants.
The problem with RMS Prop is that the learning rate has to be defined manually and the
suggested value doesn’t work for every application.
However, the actual dataset's dimensions are only two because only two variables
(rotation and scale) have been used to produce the data. Successful Dimensionality
Reduction will discard the correlated information (letter A) and recover only the variable
information (rotation and scale).
Dimensionality reduction reduces the number of dimensions (also called features and
attributes) of a dataset. It is used to remove redundancy and help both data scientists and
machines extract useful patterns.
Figure 2: "...as the dimensionality of the data increases you need a much higher number of
samples to sufficiently represent the feature space"
...
By comparison, there are only about by 1082 atoms in the entire universe (according to
Universe Today. In other words, our data set contains way more possible values than
there are atoms in the universe. Now consider that 1000 features are not much by today's
standards and that binary features are an extreme oversimplification.
This combinatorial explosion gives two problems for data science:
1. Quality of models. Traditional data science methods do not work well in such vast search
spaces.
2. Computability. It requires huge amounts of computation to find patterns in high dimensions.
High dimensional data points are all far away from each other.
When you have hundreds or thousands of dimensions, points end up near the edges. It is
counterintuitive in 2 or 3 dimensions where our brains live most of the time, but in a high-
dimensional space, the idea of closeness as a measure of similarity breaks down.
Let's gain this intuition! Think about the probability of a point ending up at the edge of a d-
dimensional unit hypercube. That is the cube with d dimensions with each edge having unit
length (length=1). We’ll use the letter e to represent the degree of error (i.e.,
1 0.998
2 0.9982=0.996
100 0.998100=0.819
1000 0.9981000=0.135
2000 0.9982000=0.018
Guided by the problems resulting from high dimensionality, we find the following
reasons to use Dimensionality Reduction:
1. Interpretability. It is not easy as a human to interpret your results when you are dealing with
many dimensions. Imagine looking at a spreadsheet with a thousand columns to find
differences between different rows. In medical or governmental applications, it can be a
requirement for the results to be interpretable.
2. Visualization. Somehow related to the previous, you might want to project your data to 2D
or 3D to produce a visual representation. After all, an image is worth a thousand words.
3. Reduce overfitting. A machine learning model can use each one of the dimensions as a
feature to distinguish between other examples. This is a remedy against the Hughes