DUnit IV

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Unit: IV

Encoder - Decoder

 What is an encoder-decoder model?

An encoder-decoder network is an unsupervised artificial neural model that consists of an


encoder component and a decoder one. The encoder takes the input and transforms it into a
compressed encoding, handed over to the decoder. The decoder strives to reconstruct the
original representation as close as possible. Ultimately, the goal is to learn a representation
(read: encoding) for a dataset. In a sense, we force the AE to memorize the training data by
devising some mnemonic rule of its own. As you see in the following figure, such a network
has, typically, a bottleneck-like shape. It starts wide, then its units/connections are squeezed
toward the middle, and then they fan out again. This architecture forces the AE to compress
the training data’s informational content, embedding it into a low-dimensional space. By the
way, you may encounter the term “latent space” for this data’s intermediate representation
space.

 Introduction Auto-encoder

Autoencoders are a specific type of feedforward neural networks where the input is the same
as the output. They compress the input into a lower-dimensional code and then reconstruct the
output from this representation. The code is a compact “summary” or “compression” of the
input, also called the latent-space representation.

An autoencoder consists of 3 components: encoder, code and decoder. The encoder


compresses the input and produces the code, the decoder then reconstructs the input only using
this code.
1. Encoder: A module that compresses the input data into an encoded representation that is
typically several orders of magnitude smaller than the input data.

Dept. of Computer Science & Engineering Page 1


2. Code: A module that contains the compressed knowledge representations and is therefore the
most important part of the network.
3. Decoder: A module that helps the network“decompress” the knowledge representations and
reconstructs the data back from its encoded form. The output is then compared with a ground
truth.

To build an autoencoder we need 3 things: an encoding method, decoding method, and a loss
function to compare the output with the target.

Autoencoders are mainly a dimensionality reduction (or compression) algorithm with a


couple of important properties:

 Data-specific: Autoencoders are only able to meaningfully compress data similar to what
they have been trained on. Since they learn features specific for the given training data,
they are different than a standard data compression algorithm like gzip. So we can’t
expect an autoencoder trained on handwritten digits to compress landscape photos.

 Lossy: The output of the autoencoder will not be exactly the same as the input, it will be
a close but degraded representation. If you want lossless compression they are not the
way to go.

 Unsupervised: To train an autoencoder we don’t need to do anything fancy, just throw


the raw input data at it. Autoencoders are considered an unsupervised learning technique
since they don’t need explicit labels to train on. But to be more precise they are self-
supervised because they generate their own labels from the training data.

 Architecture of Auto-encoder

Let’s explore the details of the encoder, code and decoder. Both the encoder and decoder are
fully-connected feedforward neural networks. Code is a single layer of an ANN with the
dimensionality of our choice. The number of nodes in the code layer (code size) is a hyper
parameter that we set before training the auto encoder.

Dept. of Computer Science & Engineering Page 2


This is a more detailed visualization of an auto encoder. First the input passes through the
encoder, which is a fully-connected ANN, to produce the code. The decoder, which has the
similar ANN structure, then produces the output only using the code. The goal is to get an
output identical with the input. Note that the decoder architecture is the mirror image of the
encoder. This is not a requirement but it’s typically the case. The only requirement is the
dimensionality of the input and output needs to be the same. Anything in the middle can be
played with.

There are 4 hyper parameters that we need to set before training an auto encoder:

 Code size: number of nodes in the middle layer. Smaller size results in more
compression.

 Number of layers: the autoencoder can be as deep as we like. In the figure above we
have 2 layers in both the encoder and decoder, without considering the input and output.

 Number of nodes per layer: the autoencoder architecture we’re working on is called
a stacked autoencoder since the layers are stacked one after another. Usually stacked
autoencoders look like a “sandwitch”. The number of nodes per layer decreases with each
subsequent layer of the encoder, and increases back in the decoder. Also the decoder is
symmetric to the encoder in terms of layer structure. As noted above this is not necessary
and we have total control over these parameters.

 Loss function: we either use mean squared error (mse) or binary cross entropy. If the
input values are in the range [0, 1] then we typically use cross entropy, otherwise we use
the mean squared error.

Dept. of Computer Science & Engineering Page 3


 Types of Auto encoder

1)Regularized Auto Encoder

2)Stochastic Auto Encoder

3)Denoising Auto Encoder

4)Contractive Auto Encoder

1) Regularized Auto Encoder :

Under complete auto encoders, with code dimension less than the input dimension, can learn
the most salient features of the data distribution. We have seen that these auto encoders fail to
learn anything useful if the encoder and decoder are given too much capacity.

A similar problem occurs if the hidden code is allowed to have dimension


equal to the input, and in the
over complete case in which the hidden code has dimension greater than the input. In these
cases, even a linear encoder and a linear decoder can learn to copy the input to the output
without learning anything useful about the data distribution.
Ideally, one could train any architecture of autoencoder successfully, choosing the code
dimension and the capacity of the encoder and decoder based on the complexity of
distribution to be modeled. Regularized autoencoders provide the ability to do so. Rather than
limiting the model capacity by keeping the encoder and decoder shallow and the code size
small, regularized autoencoders use a loss function that encourages the model to have other
properties besides the ability to copy its input to its output. These other properties include
sparsity of the representation, smallness of the derivative of the representation, and
robustness to noise or to missing inputs. A regularized autoencoder can be nonlinear and
Over complete but still learn something useful about the data distribution, even if the model
capacity is great enough to learn a trivial identity function

2) Stochastic Auto Encoder :


General strategy for designing the output units and loss function of a feedforward network is
to Define the output distribution p(y|x)
• Minimize the negative log-likelihood –log p(y|x)
• In this setting y is a vector of targets such as class labels
• In an autoencoder x is the target as well as the input
• Yet we can apply the same machinery as before, as we see next

 Loss function for Stochastic Decoder

Dept. of Computer Science & Engineering Page 4


• Given a hidden code h, we may think of the decoder as providing a conditional distribution
pdecoder(x|h)
• We train the autoencoder by minimizing –log pdecoder(x|h)
• The exact form of this loss function will change depending on the form of pdecoder(x|h)
• As with feedforward networks we use linear output units to parameterize the mean of the
Gaussian distribution if x is real
• In this case negative log-likelihood is the mean-squared error
• With binary x correspond to a Bernoulli with parameters given by a sigmoid
• Discrete x values correspond to a softmax
• The output variables are treated as being conditionally independent given h

 Stochastic encoder

• We can also generalize the notion of an encoding function f(x) to an encoding distribution
pencoder(h|x)

 Structure of stochastic autoencoder

• Both the encoder and decoder are not simple functions but involve a distribution

• The output is sampled from a distribution pencoder(h|x) for the encoder and pdecoder(x|h)
for the decoder

Dept. of Computer Science & Engineering Page 5


3. Denoising Auto encoders :
Denoising autoencoders, as the name suggests, are autoencoders that remove noise from an
image.
As opposed to autoencoders we’ve already covered, this is the first of its kind that does not have
the input image as its ground truth.
In denoising autoencoders, we feed a noisy version of the image, where noise has been added
via digital alterations. The noisy image is fed to the encoder-decoder architecture, and the output
is compared with the ground truth image.

The denoising autoencoder gets rid of noise by learning a representation of the input where the
noise can be filtered out easily.
While removing noise directly from the image seems difficult, the autoencoder performs this by
mapping the input data into a lower-dimensional manifold (like in undercomplete autoencoders),
where filtering of noise becomes much easier.
Essentially, denoising autoencoders work with the help of non-linear dimensionality reduction.
The loss function generally used in these types of networks is L2 or L1 loss.

4. Contractive Autoencoders :
Similar to other autoencoders, contractive autoencoders perform task of learning a representation
of the image while passing it through a bottleneck and reconstructing it in the decoder.
The contractive autoencoder also has a regularization term to prevent the network from learning
the identity function and mapping input into the output.
Contractive autoencoders work on the basis that similar inputs should have similar encodings
and a similar latent space representation. It means that the latent space should not vary by a huge
amount for minor variations in the input.

Dept. of Computer Science & Engineering Page 6


To train a model that works along with this constraint, we have to ensure that the derivatives of
the hidden layer activations are small with respect to the input.
An important thing to note in the loss function (formed from the norm of the derivatives and the
reconstruction loss) is that the two terms contradict each other.
While the reconstruction loss wants the model to tell differences between two inputs and
observe variations in the data, the frobenius norm of the derivatives says that the model should
be able to ignore variations in the input data.
Putting these two contradictory conditions into one loss function enables us to train a network
where the hidden layers now capture only the most essential information. This information is
necessary to separate images and ignore information that is non-discriminatory in nature, and
therefore, not important.
The total loss function can be mathematically expressed as:

The gradient is summed over all training samples, and a frobenius norm of the same is taken.

 Applications of autoencoders
Now that you understand various types of autoencoders, let’s summarize some of their most
common use cases.

1. Dimensionality reduction
Undercomplete autoencoders are those that are used for dimensionality reduction.

Dept. of Computer Science & Engineering Page 7


These can be used as a pre-processing step for dimensionality reduction as they can perform fast
and accurate dimensionality reductions without losing much information.
Furthermore, while dimensionality reduction procedures like PCA can only perform linear
dimensionality reductions, undercomplete autoencoders can perform large-scale non-linear
dimensionality reductions.

The input and the output dimension have 3000 dimensions, and the desired reduced dimension
is 200. We can develop a 5-layer network where the encoder has 3000 and 1500 neurons a
similar to the decoder network.

The vector embeddings of the compressed input layer can be considered as a reduced
dimensional embedding of the input layer.

The input and the output dimension have 3000 dimensions, and the desired reduced dimension
is 200. We can develop a 5-layer network where the encoder has 3000 and 1500 neurons a
similar to the decoder network.

The vector embeddings of the compressed input layer can be considered as a reduced
dimensional embedding of the input layer.

2. Image denoising
Autoencoders like the denoising autoencoder can be used for performing efficient and highly
accurate image denoising.

Dept. of Computer Science & Engineering Page 8


Unlike traditional methods of denoising, autoencoders do not search for noise, they extract the
image from the noisy data that has been fed to them via learning a representation of it. The
representation is then decompressed to form a noise-free image.

The noisy input image is fed into the autoencoder as input and the output noiseless output is
reconstructed by minimizing the reconstruction loss from the original target output (noiseless).
Once the autoencoder weights are trained, they can be further used to denoise the raw image.

3. Generation of image and time series data


Variational Autoencoders can be used to generate both image and time series data.
The parameterized distribution at the bottleneck of the autoencoder can be randomly sampled to
generate discrete values for latent attributes, which can then be forwarded to the decoder
,leading to generation of image data. VAEs can also be used to model time series data like
music.

4. Anomaly Detection
Under complete autoencoders can also be used for anomaly detection.

Dept. of Computer Science & Engineering Page 9


For example—consider an autoencoder that has been trained on a specific dataset P. For any
image sampled for the training dataset, the autoencoder is bound to give a low reconstruction
loss and is supposed to reconstruct the image as is.
For any image which is not present in the training dataset, however, the autoencoder cannot
perform the reconstruction, as the latent attributes are not adapted for the specific image that has
never been seen by the network.
As a result, the outlier image gives off a very high reconstruction loss and can easily be
identified as an anomaly with the help of a proper threshold.

5. Missing Value Imputation:

Denoising autoencoders can be used to impute the missing values in the dataset. The idea is to
train an autoencoder network by randomly placing missing values in the input data and trying
to reconstruct the original raw data by minimizing the reconstruction loss.

Once the autoencoder weights are trained the records having missing values can be passed
through the autoencoder network to reconstruct the input data, that too with imputed missing
features.

 Dimensionality Reduction and Classification using Auto encoders

Dimensionality Reduction is the process of reducing the number of dimensions in the data
either by excluding less useful features (Feature Selection) or transform the data into lower
dimensions (Feature Extraction). Dimensionality reduction prevents overfitting. Overfitting is
a phenomenon in which the model learns too well from the training dataset and fails to
generalize well for unseen real-world data.

Dept. of Computer Science & Engineering Page 10


Types of Feature Selection for Dimensionality Reduction,

 Recursive Feature Elimination


 Genetic Feature Selection
 Sequential Forward Selection

Types of Feature Extraction for Dimensionality Reduction,

 AutoEncoders
 Principal Component Analysis (PCA)
 Linear Determinant Analysis (LDA)

In this post, let us elaborately see about AutoEncoders for dimensionality reduction.

AutoEncoders
AutoEncoder is an unsupervised Artificial Neural Network that attempts to encode the data
by compressing it into the lower dimensions (bottleneck layer or code) and then decoding the
data to reconstruct the original input. The bottleneck layer (or code) holds the compressed
representation of the input data.

In AutoEncoder the number of output units must be equal to the number of input units since
we’re attempting to reconstruct the input data. AutoEncoders usually consist of an encoder
and a decoder. The encoder encodes the provided data into a lower dimension which is the
size of the bottleneck layer and the decoder decodes the compressed data into its original
form.

The number of neurons in the layers of the encoder will be decreasing as we move on with
further layers, whereas the number of neurons in the layers of the decoder will be increasing
as we move on with further layers. There are three layers used in the encoder and decoder in
the following example. The encoder contains 32, 16, and 7 units in each layer respectively
and the decoder contains 7, 16, and 32 units in each layer respectively. The code size/ the
number of neurons in bottle-neck must be less than the number of features in the data.

Before feeding the data into the AutoEncoder the data must definitely be scaled between 0
and 1 using MinMaxScaler since we are going to use sigmoid activation function in the
output layer which outputs values between 0 and 1.

When we are using AutoEncoders for dimensionality reduction we’ll be extracting the
bottleneck layer and use it to reduce the dimensions. This process can be viewed as feature
extraction.

Dept. of Computer Science & Engineering Page 11


The type of AutoEncoder that we’re using is Deep AutoEncoder, where the encoder and the
decoder are symmetrical. The Autoencoders don’t necessarily have a symmetrical encoder
and decoder but we can have the encoder and decoder non-symmetrical as well.

Types of AutoEncoders available are,

 Deep Autoencoder
 Sparse Autoencoder
 Under complete Autoencoder
 Variational Autoencoder
 LSTM Autoencoder

Hyperparameters of an AutoEncoder

 Code size or the number of units in the bottleneck layer


 Input and output size, which is the number of features in the data
 Number of neurons or nodes per layer
 Number of layers in encoder and decoder.
 Activation function
 Optimization function

 Optimization for Deep Learning

Deep learning is the subfield of machine learning which is used to perform complex
tasks such as speech recognition, text classification, etc. A deep learning model consists of
activation function, input, output, hidden layers, loss function, etc. Any deep learning model
tries to generalize the data using an algorithm and tries to make predictions on the unseen
data. We need an algorithm that maps the examples of inputs to that of the outputs and an
optimization algorithm. An optimization algorithm finds the value of the parameters(weights)

Dept. of Computer Science & Engineering Page 12


that minimize the error when mapping inputs to outputs. These optimization algorithms or
optimizers widely affect the accuracy of the deep learning model. They as well as affect the
speed training of the model. But first of all the question arises what an optimizer really is?

While training the deep learning model, we need to modify each epoch’s weights and
minimize the loss function. An optimizer is a function or an algorithm that modifies the
attributes of the neural network, such as weights and learning rate. Thus, it helps in reducing
the overall loss and improve the accuracy. The problem of choosing the right weights for the
model is a daunting task, as a deep learning model generally consists of millions of
parameters. It raises the need to choose a suitable optimization algorithm for your
application. Hence understanding these algorithms is necessary before having a deep dive
into the field.

You can use different optimizers to make changes in your weights and learning rate.
However, choosing the best optimizer depends upon the application. As a beginner, one evil
thought that comes to mind is that we try all the possibilities and choose the one that shows
the best results. This might not be a problem initially, but when dealing with hundreds of
gigabytes of data, even a single epoch can take a considerable amount of time. So randomly
choosing an algorithm is no less than gambling with your precious time that you will realize
sooner or later in your journey.

In this guide, we will learn about different optimizers used in building a deep learning model
and the factors that could make you choose an optimizer instead of others for your
application.

We are going to learn the following optimizers in this guide and look at the pros and cons of
each algorithm. So that, at the end of the article, you will be able to compare various
optimizers and the procedure they are based upon.

1. Gradient Descent
2. Stochastic Gradient Descent
3. Stochastic Gradient descent with momentum
4. Mini-Batch Gradient Descent
5. Adagrad
6. RMSProp
7. AdaDelta
8. Adam

Before proceeding there are few terms that you should be familiar with.

Epoch – The number of times the algorithm runs on the whole training dataset.

Sample – A single row of a dataset.

Batch – It denotes the number of samples to be taken to for updating the model parameters.

Dept. of Computer Science & Engineering Page 13


Learning rate – It is a parameter that provides the model a scale of how much model
weights should be updated.

Cost Function/Loss Function – A cost function is used to calculate the cost that is the
difference between the predicted value and the actual value.

Weights/ Bias – The learnable parameters in a model that controls the signal between two
neurons.

 Gradient Descent Deep Learning Optimizer

Gradient Descent can be considered as the popular kid among the class of optimizers. This
optimization algorithm uses calculus to modify the values consistently and to achieve the
local minimum. Before moving ahead, you might have the question of what a gradient is?

In simple terms, consider you are holding a ball resting at the top of a bowl. When you lose
the ball, it goes along the steepest direction and eventually settles at the bottom of the bowl.
A Gradient provides the ball in the steepest direction to reach the local minimum that is the
bottom of the bowl.

The above equation means how the gradient is calculated. Here alpha is step size that
represents how far to move against each gradient with each iteration.

Gradient descent works as follows:

1. It starts with some coefficients, sees their cost, and searches for cost value lesser than
what it is now.
2. It moves towards the lower weight and updates the value of the coefficients.
3. The process repeats until the local minimum is reached. A local minimum is a point
beyond which it can not proceed.

Dept. of Computer Science & Engineering Page 14


KDnuggets
Gradient descent works best for most purposes. However, it has some downsides too. It is
expensive to calculate the gradients if the size of the data is huge. Gradient descent works
well for convex functions but it doesn’t know how far to travel along the gradient for
nonconvex functions.

 Mini Batch Gradient Descent Deep Learning Optimizer

In this variant of gradient descent instead of taking all the training data, only a subset of the
dataset is used for calculating the loss function. Since we are using a batch of data instead of
taking the whole dataset, fewer iterations are needed. That is why the mini-batch gradient
descent algorithm is faster than both stochastic gradient descent and batch gradient descent
algorithms. This algorithm is more efficient and robust than the earlier variants of gradient
descent. As the algorithm uses batching, all the training data need not be loaded in the
memory, thus making the process more efficient to implement. Moreover, the cost function in
mini-batch gradient descent is noisier than the batch gradient descent algorithm but smoother
than that of the stochastic gradient descent algorithm. Because of this, mini-batch gradient
descent is ideal and provides a good balance between speed and accuracy.

Despite, all that, the mini-batch gradient descent algorithm has some downsides too. It needs
a hyperparameter that is “mini-batch-size”, which needs to be tuned to achieve the required
accuracy. Although, the batch size of 32 is considered to be appropriate for almost every
case. Also, in some cases, it results in poor final accuracy. Due to this, there needs a rise to
look for other alternatives too.

 Adagrad (Adaptive Gradient Descent) Deep Learning Optimizer

The adaptive gradient descent algorithm is slightly different from other gradient descent
algorithms. This is because it uses different learning rates for each iteration. The change in
learning rate depends upon the difference in the parameters during training. The more the
parameters get change, the more minor the learning rate changes. This modification is highly
beneficial because real-world datasets contain sparse as well as dense features. So it is unfair
to have the same value of learning rate for all the features. The Adagrad algorithm uses the
below formula to update the weights. Here the alpha(t) denotes the different learning rates at
each iteration, n is a constant, and E is a small positive to avoid division by 0.

Dept. of Computer Science & Engineering Page 15


The benefit of using Adagrad is that it abolishes the need to modify the learning rate
manually. It is more reliable than gradient descent algorithms and their variants, and it
reaches convergence at a higher speed.

One downside of AdaGrad optimizer is that it decreases the learning rate aggressively and
monotonically. There might be a point when the learning rate becomes extremely small. This
is because the squared gradients in the denominator keep accumulating, and thus the
denominator part keeps on increasing. Due to small learning rates, the model eventually
becomes unable to acquire more knowledge, and hence the accuracy of the model is
compromised.

 Stochastic Gradient Descent Deep Learning Optimizer

At the end of the previous section, you learned why using gradient descent on massive data
might not be the best option. To tackle the problem, we have stochastic gradient descent. The
term stochastic means randomness on which the algorithm is based upon. In stochastic
gradient descent, instead of taking the whole dataset for each iteration, we randomly select
the batches of data. That means we only take few samples from the dataset.

The procedure is first to select the initial parameters w and learning rate n. Then randomly
shuffle the data at each iteration to reach an approximate minimum.

Since we are not using the whole dataset but the batches of it for each iteration, the path took
by the algorithm is full of noise as compared to the gradient descent algorithm. Thus, SGD
uses a higher number of iterations to reach the local minima. Due to an increase in the
number of iterations, the overall computation time increases. But even after increasing the
number of iterations, the computation cost is still less than that of the gradient descent
optimizer. So the conclusion is if the data is enormous and computational time is an essential
factor, stochastic gradient descent should be preferred over batch gradient descent algorithm.

Stochastic Gradient Descent with Momentum Deep Learning Optimizer


As discussed in the earlier section, you have learned that stochastic gradient descent takes a
much more noisy path than the gradient descent algorithm. Due to this reason, it requires a
more significant number of iterations to reach the optimal minimum and hence computation
time is very slow. To overcome the problem, we use stochastic gradient descent with a
momentum algorithm.

Dept. of Computer Science & Engineering Page 16


What the momentum does is helps in faster convergence of the loss function. Stochastic
gradient descent oscillates between either direction of the gradient and updates the weights
accordingly. However, adding a fraction of the previous update to the current update will
make the process a bit faster. One thing that should be remembered while using this
algorithm is that the learning rate should be decreased with a high momentum term.

In the above image, the left part shows the convergence graph of the stochastic gradient
descent algorithm. At the same time, the right side shows SGD with momentum. From the
image, you can compare the path chosen by both the algorithms and realize that using
momentum helps reach convergence in less time. You might be thinking of using a large
momentum and learning rate to make the process even faster. But remember that while
increasing the momentum, the possibility of passing the optimal minimum also increases.
This might result in poor accuracy and even more oscillations.

 RMS Prop(Root Mean Square) Deep Learning Optimizer

RMS prop is one of the popular optimizers among deep learning enthusiasts. This is maybe
because it hasn’t been published but still very well know in the community. RMS prop is
ideally an extension of the work RPPROP. RPPROP resolves the problem of varying
gradients. The problem with the gradients is that some of them were small while others may
be huge. So, defining a single learning rate might not be the best idea. RPPROP uses the sign
of the gradient adapting the step size individually for each weight. In this algorithm, the two
gradients are first compared for signs. If they have the same sign, we’re going in the right
direction and hence increase the step size by a small fraction. Whereas, if they have opposite
signs, we have to decrease the step size. Then we limit the step size, and now we can go for
the weight update.

The problem with RPPROP is that it doesn’t work well with large datasets and when we want
to perform mini-batch updates. So, achieving the robustness of RPPROP and efficiency of
mini-batches at the same time was the main motivation behind the rise of RMS prop. RMS
prop can also be considered an advancement in AdaGrad optimizer as it reduces the
monotonically decreasing learning rate.

Dept. of Computer Science & Engineering Page 17


The algorithm mainly focuses on accelerating the optimization process by decreasing the
number of function evaluations to reach the local minima. The algorithm keeps the moving
average of squared gradients for every weight and divides the gradient by the square root of
the mean square.

where gamma is the forgetting factor. Weights are updated by the below formula

In simpler terms, if there exists a parameter due to which the cost function oscillates a lot, we
want to penalize the update of this parameter. Suppose you built a model to classify a variety
of fishes. The model relies on the factor ‘color’ mainly to differentiate between the fishes.
Due to which it makes a lot of errors. What RMS Prop does is, penalize the parameter ‘color’
so that it can rely on other features too. This prevents the algorithm from adapting too quickly
to changes in the parameter ‘color’ compared to other parameters. This algorithm has several
benefits as compared to earlier versions of gradient descent algorithms. The algorithm
converges quickly and requires lesser tuning than gradient descent algorithms and their
variants.

The problem with RMS Prop is that the learning rate has to be defined manually and the
suggested value doesn’t work for every application.

 Application case study– Image dimensionality

The dimensionality of a dataset refers to its features, attributes, or variables. When we


are talking about tabular data residing in spreadsheets, the dimensions are defined as
columns. For an RGB image, the dimensions are the color intensity values of each pixel.

To understand the usefulness of dimensionality reduction, consider a dataset that


contains images of the letter A (Figure 1), which has been scaled and rotated with
varying intensity. Each image has 32×32 pixels, aka 32×32=1024 dimensions.

However, the actual dataset's dimensions are only two because only two variables
(rotation and scale) have been used to produce the data. Successful Dimensionality
Reduction will discard the correlated information (letter A) and recover only the variable
information (rotation and scale).

Dept. of Computer Science & Engineering Page 18


It is not uncommon for modern datasets to have thousands, if not millions, of
dimensions. Usually, data with this many dimensions are very sparse, highly correlated,
and therefore quite redundant. Dimensionality reduction aims to keep the essence of the
data in a few representative variables. This helps make the data more intuitive both for
us data scientists and for the machines.

Dimensionality reduction reduces the number of dimensions (also called features and
attributes) of a dataset. It is used to remove redundancy and help both data scientists and
machines extract useful patterns.

 The Curse of Dimensionality

Figure 2: "...as the dimensionality of the data increases you need a much higher number of
samples to sufficiently represent the feature space"

The "curse of dimensionality" manifests itself in a multitude of ways in data science:

1. Distance : is not a good measure of distinction in high dimensions. With high


dimensionality, most data points end up far from each other. Therefore, many machine

Dept. of Computer Science & Engineering Page 19


learning algorithms that rely on distance functions to measure data similarity cannot deal
with high-dimensional data. This includes algorithms like K-Means, K-Nearest
Neighbors (KNN).
2. Need for more data : Each feature increases the amount of data needed for estimation.
A good rule of thumb is to have at least 10 examples for each feature in your dataset. For
example, if your data has 100 features, you should have at least 1000 examples before
searching for patterns.
3. Overfitting : A learning algorithm may find spurious correlations between the features
and the target when you use too many features. Overfit models spuriously match the
training data and do not generalize well to test data.
4. Combinatorial explosion : The search space becomes exponentially larger when you
add more dimensions. Imagine the simple case where you only have binary features,
meaning they only take the values of 0 and 1.
1 feature has 2 possible values

2 features have 22=4 possible values

...

1000 features have 21000≈10300 possible values.

By comparison, there are only about by 1082 atoms in the entire universe (according to
Universe Today. In other words, our data set contains way more possible values than
there are atoms in the universe. Now consider that 1000 features are not much by today's
standards and that binary features are an extreme oversimplification.
This combinatorial explosion gives two problems for data science:

1. Quality of models. Traditional data science methods do not work well in such vast search
spaces.
2. Computability. It requires huge amounts of computation to find patterns in high dimensions.

High dimensional data points are all far away from each other.

When you have hundreds or thousands of dimensions, points end up near the edges. It is
counterintuitive in 2 or 3 dimensions where our brains live most of the time, but in a high-
dimensional space, the idea of closeness as a measure of similarity breaks down.

Let's gain this intuition! Think about the probability of a point ending up at the edge of a d-
dimensional unit hypercube. That is the cube with d dimensions with each edge having unit
length (length=1). We’ll use the letter e to represent the degree of error (i.e.,

Dept. of Computer Science & Engineering Page 20


distance from the edge) we are willing to tolerate. . When any of the point's dimensions
is "close" to 0 or 1, then the point lies at an edge. According to Figure 2, the probability
of a dimension not ending close to 0 or 1 is 1−2e.
The probability of all N dimensions not ending close to 0 or 1 (epsilon distance from 0
or 1) is (1−2e)d. Let see what this means for a small e=0.001 for a varying number of
dimensions.

Dimensions (d) Probability (1−2∗0.001)d

1 0.998

2 0.9982=0.996

100 0.998100=0.819

1000 0.9981000=0.135

2000 0.9982000=0.018

Figure 3: Probability of a dimension (not) ending up "close" (epsilon difference) to 0 or 1.

 When to use Dimensionality Reduction

Guided by the problems resulting from high dimensionality, we find the following
reasons to use Dimensionality Reduction:

1. Interpretability. It is not easy as a human to interpret your results when you are dealing with
many dimensions. Imagine looking at a spreadsheet with a thousand columns to find
differences between different rows. In medical or governmental applications, it can be a
requirement for the results to be interpretable.
2. Visualization. Somehow related to the previous, you might want to project your data to 2D
or 3D to produce a visual representation. After all, an image is worth a thousand words.
3. Reduce overfitting. A machine learning model can use each one of the dimensions as a
feature to distinguish between other examples. This is a remedy against the Hughes

Dept. of Computer Science & Engineering Page 21


phenomenon, which tells us that more features lead to worse model quality beyond a certain
point.
4. Scalability. Many algorithms do not scale well when the number of dimensions increases.
Dimensionality Reduction can be used to ease the computations and find solutions in
reasonable amounts of time.

Dept. of Computer Science & Engineering Page 22

You might also like