Professional Documents
Culture Documents
CVDL Cae 2
CVDL Cae 2
Autoencoders are a specific type of feedforward neural networks where the input is the same as the output. They
compress the input into a lower-dimensional code and then reconstruct the output from this representation. The code
is a compact “summary” or “compression” of the input, also called the latent-space representation.
An autoencoder consists of 3 components: encoder, code and decoder. The encoder compresses the input and produces
the code, the decoder then reconstructs the input only using this code.
To build an autoencoder we need 3 things: an encoding method, decoding method, and a loss function to compare the
output with the target. We will explore these in the next section.
Autoencoders are mainly a dimensionality reduction (or compression) algorithm with a couple of important properties:
• Data-specific: Autoencoders are only able to meaningfully compress data similar to what they have been
trained on. Since they learn features specific for the given training data, they are different than a standard
data compression algorithm like gzip. So we can’t expect an autoencoder trained on handwritten digits to
compress landscape photos.
• Lossy: The output of the autoencoder will not be exactly the same as the input, it will be a close but
degraded representation. If you want lossless compression they are not the way to go.
• Unsupervised: To train an autoencoder we don’t need to do anything fancy, just throw the raw input data
at it. Autoencoders are considered an unsupervised learning technique since they don’t need explicit
labels to train on. But to be more precise they are self-supervised because they generate their own labels
from the training data.
2. Architecture
Let’s explore the details of the encoder, code and decoder. Both the encoder and decoder are fully-connected
feedforward neural networks, essentially the ANNs we covered in Part 1. Code is a single layer of an ANN with the
dimensionality of our choice. The number of nodes in the code layer (code size) is a hyperparameter that we set before
training the autoencoder.
This is a more detailed visualization of an autoencoder. First the input passes through the encoder, which is a fully-
connected ANN, to produce the code. The decoder, which has the similar ANN structure, then produces the output only
using the code. The goal is to get an output identical with the input. Note that the decoder architecture is the mirror
image of the encoder. This is not a requirement but it’s typically the case. The only requirement is the dimensionality
of the input and output needs to be the same. Anything in the middle can be played with.
• Code size: number of nodes in the middle layer. Smaller size results in more compression.
• Number of layers: the autoencoder can be as deep as we like. In the figure above we have 2 layers in both
the encoder and decoder, without considering the input and output.
• Number of nodes per layer: the autoencoder architecture we’re working on is called a stacked
autoencoder since the layers are stacked one after another. Usually stacked autoencoders look like a
“sandwitch”. The number of nodes per layer decreases with each subsequent layer of the encoder, and
increases back in the decoder. Also the decoder is symmetric to the encoder in terms of layer structure.
As noted above this is not necessary and we have total control over these parameters.
• Loss function: we either use mean squared error (mse) or binary crossentropy. If the input values are in
the range [0, 1] then we typically use crossentropy, otherwise we use the mean squared error. For more
details check out this video.
Autoencoders are trained the same way as ANNs via backpropagation. Check out the introduction of Part 1 for more
details on how neural networks are trained, it directly applies to the autoencoders.
In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge regression reduces
the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes the cost function of
the linear regression model. Hence, for the minimum value
of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the independent variables, so
to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It stands for Least
Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute weights instead of a
square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only shrink it near to
0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:
o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature selection.
Describe Early stopping, Dataset augmentation, Parameter sharing and tying,
In Regularization by Early Stopping, we stop training the model when the performance on the validation set is getting
worse- increasing loss decreasing accuracy, or poorer scores of the scoring metric. By plotting the error on the training
dataset and the validation dataset together, both the errors decrease with a number of iterations until the point where
the model starts to overfit. After this point, the training error still decreases but the validation error increases.
So, even if training is continued after this point, early stopping essentially returns the set of parameters that were used
at this point and so is equivalent to stopping training at that point. So, the final parameters returned will enable the
model to have low variance and better generalization. The model at the time the training is stopped will have a better
generalization performance than the model with the least training error.
on the validation set is getting worse- increasing loss or decreasing accuracy or poorer scores Early stopping can be
thought of as implicit regularization, contrary to regularization via weight decay. This method is also efficient since it
requires less amount of training data, which is not always available. Due to this fact, early stopping requires lesser time
for training compared to other regularization methods. Repeating the early stopping process many times may result in
the model overfitting the validation dataset, just as similar as overfitting occurs in the case of training data.
The number of iterations(i.e. epoch) taken to train the model can be considered a hyperparameter. Then the model
has to find an optimum value for this hyperparameter (by hyperparameter tuning) for the best performance of the
learning model.
Dataset Augmentation : -
Data augmentation is a technique of artificially increasing the training set by creating modified copies of a dataset using
existing data. It includes making minor changes to the dataset or using deep learning to generate new data points. Our
model was effectively trained to classify the training data. It did not generalize well for the validation data to fix the
overfishing issue. Now, let's discuss one more technique to improve the model training process. This technique is
known as data augmentation. It is the process by which we create new data for our model to use during the training
process. This is done by taking our existing dataset and transforming or altering the image in useful ways to create new
images. After applying the transformation, the newly created images are known as augmented images because they
essentially allow us to augment our dataset by adding new data to it. The data augmentation technique is useful
because it allows our model to look at each image in our dataset from a variety of different perspective. This allows
our model to extract relevant features more accurately and to obtain more feature-related data from each training
image.
Parameter Sharing
Parameter sharing forces sets of parameters to be similar as we interpret various models or model components as
sharing a unique set of parameters. We only need to store only a subset of memory. Suppose two models A and B,
perform a classification task on similar input and output distributions. In such a case, we'd expect the parameters for
both models to be identical to each other as well. We could impose a norm penalty on the distance between the
weights, but a more popular method is to force the parameters to be equal. The idea behind Parameter Sharing is the
essence of forcing the parameters to be similar. A significant benefit here is that we need to store only a subset of the
parameters (e.g., storing only the parameters for model A instead of storing for both A and B), which leads to significant
memory savings.
Parameter Tying
Parameter tying is a regularization technique. We divide the parameters or weights of a machine learning model into
groups by leveraging prior knowledge, and all parameters in each group are constrained to take the same value. In
simple terms, we want to express that specific parameter should be close to each other.
List and explain ensemble methods.
Ensemble learning is a machine learning technique that enhances accuracy and resilience in forecasting by merging
predictions from multiple models. It aims to mitigate errors or biases that may exist in individual models by leveraging
the collective intelligence of the ensemble.
Main Types of Ensemble Methods
1. Bagging
Bagging, the short form for bootstrap aggregating, is mainly applied in classification and regression. It increases the
accuracy of models through decision trees, which reduces variance to a large extent. The reduction of variance increases
accuracy, eliminating overfitting, which is a challenge to many predictive models.
Bagging is classified into two types, i.e., bootstrapping and aggregation. Bootstrapping is a sampling technique where
samples are derived from the whole population (set) using the replacement procedure. The sampling with replacement
method helps make the selection procedure randomized. The base learning algorithm is run on the samples to complete
the procedure.
Aggregation in bagging is done to incorporate all possible outcomes of the prediction and randomize the outcome.
Without aggregation, predictions will not be accurate because all outcomes are not put into consideration. Therefore,
the aggregation is based on the probability bootstrapping procedures or on the basis of all outcomes of the predictive
models.
Bagging is advantageous since weak base learners are combined to form a single strong learner that is more stable
than single learners. It also eliminates any variance, thereby reducing the overfitting of models. One limitation of
bagging is that it is computationally expensive. Thus, it can lead to more bias in models when the proper procedure of
bagging is ignored.
2. Boosting
Boosting is an ensemble technique that learns from previous predictor mistakes to make better predictions in the
future. The technique combines several weak base learners to form one strong learner, thus significantly improving
the predictability of models. Boosting works by arranging weak learners in a sequence, such that weak learners learn
from the next learner in the sequence to create better predictive models.
Boosting takes many forms, including gradient boosting, Adaptive Boosting (AdaBoost), and XGBoost (Extreme
Gradient Boosting). AdaBoost uses weak learners in the form of decision trees, which mostly include one split that is
popularly known as decision stumps. AdaBoost’s main decision stump comprises observations carrying similar weights.
Gradient boosting adds predictors sequentially to the ensemble, where preceding predictors correct their successors,
thereby increasing the model’s accuracy. New predictors are fit to counter the effects of errors in the previous
predictors. The gradient of descent helps the gradient booster identify problems in learners’ predictions and counter
them accordingly.
XGBoost makes use of decision trees with boosted gradient, providing improved speed and performance. It relies
heavily on the computational speed and the performance of the target model. Model training should follow a
sequence, thus making the implementation of gradient boosted machines slow.
3. Stacking
Stacking, another ensemble method, is often referred to as stacked generalization. This technique works by allowing
a training algorithm to ensemble several other similar learning algorithm predictions. Stacking has been successfully
implemented in regression, density estimations, distance learning, and classifications. It can also be used to measure
the error rate involved during bagging.