Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

CVDL CAE 2

What is Autoencoders? Explain the architectural design?

Autoencoders are a specific type of feedforward neural networks where the input is the same as the output. They
compress the input into a lower-dimensional code and then reconstruct the output from this representation. The code
is a compact “summary” or “compression” of the input, also called the latent-space representation.

An autoencoder consists of 3 components: encoder, code and decoder. The encoder compresses the input and produces
the code, the decoder then reconstructs the input only using this code.

To build an autoencoder we need 3 things: an encoding method, decoding method, and a loss function to compare the
output with the target. We will explore these in the next section.

Autoencoders are mainly a dimensionality reduction (or compression) algorithm with a couple of important properties:

• Data-specific: Autoencoders are only able to meaningfully compress data similar to what they have been
trained on. Since they learn features specific for the given training data, they are different than a standard
data compression algorithm like gzip. So we can’t expect an autoencoder trained on handwritten digits to
compress landscape photos.

• Lossy: The output of the autoencoder will not be exactly the same as the input, it will be a close but
degraded representation. If you want lossless compression they are not the way to go.

• Unsupervised: To train an autoencoder we don’t need to do anything fancy, just throw the raw input data
at it. Autoencoders are considered an unsupervised learning technique since they don’t need explicit
labels to train on. But to be more precise they are self-supervised because they generate their own labels
from the training data.

2. Architecture

Let’s explore the details of the encoder, code and decoder. Both the encoder and decoder are fully-connected
feedforward neural networks, essentially the ANNs we covered in Part 1. Code is a single layer of an ANN with the
dimensionality of our choice. The number of nodes in the code layer (code size) is a hyperparameter that we set before
training the autoencoder.
This is a more detailed visualization of an autoencoder. First the input passes through the encoder, which is a fully-
connected ANN, to produce the code. The decoder, which has the similar ANN structure, then produces the output only
using the code. The goal is to get an output identical with the input. Note that the decoder architecture is the mirror
image of the encoder. This is not a requirement but it’s typically the case. The only requirement is the dimensionality
of the input and output needs to be the same. Anything in the middle can be played with.

There are 4 hyperparameters that we need to set before training an autoencoder:

• Code size: number of nodes in the middle layer. Smaller size results in more compression.

• Number of layers: the autoencoder can be as deep as we like. In the figure above we have 2 layers in both
the encoder and decoder, without considering the input and output.

• Number of nodes per layer: the autoencoder architecture we’re working on is called a stacked
autoencoder since the layers are stacked one after another. Usually stacked autoencoders look like a
“sandwitch”. The number of nodes per layer decreases with each subsequent layer of the encoder, and
increases back in the decoder. Also the decoder is symmetric to the encoder in terms of layer structure.
As noted above this is not necessary and we have total control over these parameters.

• Loss function: we either use mean squared error (mse) or binary crossentropy. If the input values are in
the range [0, 1] then we typically use crossentropy, otherwise we use the mean squared error. For more
details check out this video.

Autoencoders are trained the same way as ANNs via backpropagation. Check out the introduction of Part 1 for more
details on how neural networks are trained, it directly applies to the autoencoders.

Explain Regularization: Bias Variance Tradeoff


Regularization is one of the most important concepts of machine learning. It is a technique to prevent the model from
overfitting by adding extra information to it. Sometimes the machine learning model performs well with the training
data but does not perform well with the test data. It means the model is not able to predict the output when deals
with unseen data by introducing noise in the output, and hence the model is called overfitted. This problem can be
deal with the help of a regularization technique. This technique can be used in such a way that it will allow to maintain
all variables or features in the model by reducing the magnitude of the variables. Hence, it maintains accuracy as well
as a generalization of the model.
It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In regularization technique,
we reduce the magnitude of the features by keeping the same number of features." Bias Value Tradeoff :
The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both
accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is
typically impossible to do both simultaneously. It is important to understand prediction errors (bias and variance) when
it comes to accuracy in any machine-learning algorithm. There is a tradeoff between a model’s ability to minimize bias
and variance which is referred to as the best solution for selecting a value of Regularization constant. A proper
understanding of these errors would help to avoid the overfitting and underfitting of a data set while training the
algorithm.
Bias Variance Tradeoff
If the algorithm is too simple (hypothesis with linear equation) then it may be on high bias and low variance condition
and thus is error-prone. If algorithms fit too complex (hypothesis with high degree equation) then it may be on high
variance and low bias. In the latter condition, the new entries will not perform well. Well, there is something between
both of these conditions, known as a Trade-off or Bias Variance Trade-off. This tradeoff in complexity is why there is a
tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time. For the
graph, the perfect tradeoff will be like this.

Explain regularization method and describe in detail


There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression (L2 regularization)
o Ridge regression is one of the types of linear regression in which a small amount of bias is introduced so that we can
get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is also called
as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added to the model
is called Ridge Regression penalty. We can calculate it by multiplying with the lambda to the squared weight of each
individual feature.
o The equation for the cost function in ridge regression will be:

In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge regression reduces
the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes the cost function of
the linear regression model. Hence, for the minimum value
of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the independent variables, so
to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It stands for Least
Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute weights instead of a
square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only shrink it near to
0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:

o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature selection.
Describe Early stopping, Dataset augmentation, Parameter sharing and tying,
In Regularization by Early Stopping, we stop training the model when the performance on the validation set is getting
worse- increasing loss decreasing accuracy, or poorer scores of the scoring metric. By plotting the error on the training
dataset and the validation dataset together, both the errors decrease with a number of iterations until the point where
the model starts to overfit. After this point, the training error still decreases but the validation error increases.
So, even if training is continued after this point, early stopping essentially returns the set of parameters that were used
at this point and so is equivalent to stopping training at that point. So, the final parameters returned will enable the
model to have low variance and better generalization. The model at the time the training is stopped will have a better
generalization performance than the model with the least training error.
on the validation set is getting worse- increasing loss or decreasing accuracy or poorer scores Early stopping can be
thought of as implicit regularization, contrary to regularization via weight decay. This method is also efficient since it
requires less amount of training data, which is not always available. Due to this fact, early stopping requires lesser time
for training compared to other regularization methods. Repeating the early stopping process many times may result in
the model overfitting the validation dataset, just as similar as overfitting occurs in the case of training data.
The number of iterations(i.e. epoch) taken to train the model can be considered a hyperparameter. Then the model
has to find an optimum value for this hyperparameter (by hyperparameter tuning) for the best performance of the
learning model.

Dataset Augmentation : -
Data augmentation is a technique of artificially increasing the training set by creating modified copies of a dataset using
existing data. It includes making minor changes to the dataset or using deep learning to generate new data points. Our
model was effectively trained to classify the training data. It did not generalize well for the validation data to fix the
overfishing issue. Now, let's discuss one more technique to improve the model training process. This technique is
known as data augmentation. It is the process by which we create new data for our model to use during the training
process. This is done by taking our existing dataset and transforming or altering the image in useful ways to create new
images. After applying the transformation, the newly created images are known as augmented images because they
essentially allow us to augment our dataset by adding new data to it. The data augmentation technique is useful
because it allows our model to look at each image in our dataset from a variety of different perspective. This allows
our model to extract relevant features more accurately and to obtain more feature-related data from each training
image.
Parameter Sharing
Parameter sharing forces sets of parameters to be similar as we interpret various models or model components as
sharing a unique set of parameters. We only need to store only a subset of memory. Suppose two models A and B,
perform a classification task on similar input and output distributions. In such a case, we'd expect the parameters for
both models to be identical to each other as well. We could impose a norm penalty on the distance between the
weights, but a more popular method is to force the parameters to be equal. The idea behind Parameter Sharing is the
essence of forcing the parameters to be similar. A significant benefit here is that we need to store only a subset of the
parameters (e.g., storing only the parameters for model A instead of storing for both A and B), which leads to significant
memory savings.
Parameter Tying
Parameter tying is a regularization technique. We divide the parameters or weights of a machine learning model into
groups by leveraging prior knowledge, and all parameters in each group are constrained to take the same value. In
simple terms, we want to express that specific parameter should be close to each other.
List and explain ensemble methods.
Ensemble learning is a machine learning technique that enhances accuracy and resilience in forecasting by merging
predictions from multiple models. It aims to mitigate errors or biases that may exist in individual models by leveraging
the collective intelligence of the ensemble.
Main Types of Ensemble Methods
1. Bagging
Bagging, the short form for bootstrap aggregating, is mainly applied in classification and regression. It increases the
accuracy of models through decision trees, which reduces variance to a large extent. The reduction of variance increases
accuracy, eliminating overfitting, which is a challenge to many predictive models.
Bagging is classified into two types, i.e., bootstrapping and aggregation. Bootstrapping is a sampling technique where
samples are derived from the whole population (set) using the replacement procedure. The sampling with replacement
method helps make the selection procedure randomized. The base learning algorithm is run on the samples to complete
the procedure.
Aggregation in bagging is done to incorporate all possible outcomes of the prediction and randomize the outcome.
Without aggregation, predictions will not be accurate because all outcomes are not put into consideration. Therefore,
the aggregation is based on the probability bootstrapping procedures or on the basis of all outcomes of the predictive
models.
Bagging is advantageous since weak base learners are combined to form a single strong learner that is more stable
than single learners. It also eliminates any variance, thereby reducing the overfitting of models. One limitation of
bagging is that it is computationally expensive. Thus, it can lead to more bias in models when the proper procedure of
bagging is ignored.
2. Boosting

Boosting is an ensemble technique that learns from previous predictor mistakes to make better predictions in the
future. The technique combines several weak base learners to form one strong learner, thus significantly improving
the predictability of models. Boosting works by arranging weak learners in a sequence, such that weak learners learn
from the next learner in the sequence to create better predictive models.

Boosting takes many forms, including gradient boosting, Adaptive Boosting (AdaBoost), and XGBoost (Extreme
Gradient Boosting). AdaBoost uses weak learners in the form of decision trees, which mostly include one split that is
popularly known as decision stumps. AdaBoost’s main decision stump comprises observations carrying similar weights.

Gradient boosting adds predictors sequentially to the ensemble, where preceding predictors correct their successors,
thereby increasing the model’s accuracy. New predictors are fit to counter the effects of errors in the previous
predictors. The gradient of descent helps the gradient booster identify problems in learners’ predictions and counter
them accordingly.

XGBoost makes use of decision trees with boosted gradient, providing improved speed and performance. It relies
heavily on the computational speed and the performance of the target model. Model training should follow a
sequence, thus making the implementation of gradient boosted machines slow.

3. Stacking

Stacking, another ensemble method, is often referred to as stacked generalization. This technique works by allowing
a training algorithm to ensemble several other similar learning algorithm predictions. Stacking has been successfully
implemented in regression, density estimations, distance learning, and classifications. It can also be used to measure
the error rate involved during bagging.

Explain Batch Normalization in detail.


Batch normalization:-
Batch normalization works by normalizing the output of a previous activation layer by subtracting the batch mean and
dividing by the batch standard deviation. After this step, the result is then scaled and shifted by two learnable
parameters, gamma and beta, which are unique to each layer.
Batch normalization works by normalizing the output of a previous activation layer by subtracting the batch mean and
dividing by the batch standard deviation. After this step, the result is then scaled and shifted by two learnable
parameters, gamma and beta, which are unique to each layer. This process allows the model to maintain the mean
activation close to 0 and the activation standard deviation close to 1.
The normalization step is as follows:
1. Calculate the mean and variance of the activations for each feature in a mini- batch.
2. Normalize the activations of each feature by subtracting the mini-batch mean and dividing by the mini-batch
standard deviation.
3. Scale and shift the normalized values using the learnable parameters gamma and beta, which allow the network to
undo the normalization if that is what the learned behavior requires.
Batch normalization is typically applied before the activation function in a network layer, although some variations may
apply it after the activation function.
Benefits of Batch Normalization
Batch normalization offers several benefits to the training process of deep neural networks:
• Improved Optimization: It allows the use of higher learning rates, speeding up the training process by reducing the
careful tuning of parameters.
• Regularization: It adds a slight noise to the activations, similar to dropout. This can help to regularize the model and
reduce overfitting.
• Reduced Sensitivity to Initialization: It makes the network less sensitive to the initial starting weights.
• Allows Deeper Networks: By reducing internal covariate shift, batch normalization allows for the training of deeper
networks.
Explain Dropout Greedy Layer wise Pre-training

You might also like