Module 4

UNIT – IV
Syllabus: Auto encoders: Under complete auto encoders, regularized encoders, stochastic
encoders and decoders. Deep generative models: Boltzmann Machines, Restricted Boltzmann
machines, Deep Belief networks, Deep Boltzmann machines for real world data
1. Auto Encoders:
An autoencoder is a neural network that is trained to attempt to copy its input to its output.
Internally, it has a hidden layer h that describes a code used to represent the input. The
network may be viewed as consisting of two parts: an encoder function h = f (x) and a
decoder that produces a reconstruction r = g(h). Autoencoders are designed to be unable to
learn to copy perfectly. Usually, they are restricted in ways that allow them to copy only
approximately, and to copy only input that resembles the training data.
Autoencoders may be thought of as being a special case of feedforward networks, and may be
trained with all of the same techniques, typically minibatch gradient descent following
gradients computed by back-propagation. autoencoders may also be trained using
‘recirculation’, a learning algorithm based on comparing the activations of the network on the
original input to the activations on the reconstructed. See Figure:
2. Under Complete Auto Encoders:

To obtain useful features from the autoencoder is to constrain h to have smaller dimension
than x. An autoencoder whose code dimension is less than the input dimension is called
undercomplete. Learning an undercomplete representation forces the autoencoder to capture
the most salient features of the training data. The learning process is described simply as
minimizing a loss function L(x, g(f(x))), where L is a loss function penalizing g(f (x)) for
being dissimilar from x, such as the mean squared error.
When the decoder is linear and L is the mean squared error, an undercomplete autoencoder
learns to span the same subspace as PCA. Autoencoders with nonlinear encoder functions f
and nonlinear decoder functions g can thus learn a more powerful nonlinear generalization of
PCA.
If the encoder and decoder are allowed too much capacity, the autoencoder can learn to
perform the copying task without extracting useful information about the distribution of the
data. Theoretically, one could imagine that an autoencoder with a one-dimensional code but a
very powerful nonlinear encoder could learn to represent each training example x(i) with the
code i. The decoder could learn to map these integer indices back to the values of specific
training examples.
3. Regularized Auto Encoders:
If the hidden code is allowed to have dimension equal to the input, and in the overcomplete
case in which the hidden code has dimension greater than the input. In these cases, even a
linear encoder and linear decoder can learn to copy the input to the output without learning
anything useful about the data distribution.
One could train any architecture of autoencoder successfully, choosing the code dimension
and the capacity of the encoder and decoder based on the complexity of distribution to be
modelled. Regularized autoencoders provide the ability to do so. Regularized autoencoders
use a loss function that encourages the model to have other properties besides the ability to
copy its input to its output. A regularized autoencoder can be nonlinear and overcomplete but
still learn something useful about the data distribution even if the model capacity is great
enough to learn a trivial identity function.
3.1. Sparse Encoders

A sparse autoencoder is simply an autoencoder whose training criterion involves a sparsity
penalty Ω(h) on the code layer h, in addition to the reconstruction error:
L(x, g(f(x))) + Ω(h)
where g(h) is the decoder output and typically we have h = f (x), the encoder output.
Sparse autoencoders are typically used to learn features for another task such as classification
An autoencoder that has been regularized to be sparse must respond to unique statistical
features of the dataset it has been trained on, rather than simply acting as an identity function.
We can think of the penalty Ω(h) simply as a regularizer term added to a feedforward network
whose primary task is to copy the input to the output and possibly also perform some
supervised task that depends on these sparse features.
Rather than thinking of the sparsity penalty as a regularizer for the copying task, we can think
of the entire sparse autoencoder framework as approximating maximum likelihood training of
a generative model that has latent variables. Suppose we have a model with visible variables
x and latent variables h, with an explicit joint distribution
p-model(x, h) =p-model(h)*p-model(x | h).
We refer to p-model(h) as the model’s prior distribution over the latent variables, representing
the model’s beliefs prior to seeing x. The likelihood can be decomposed as
From this point of view, with this chosen h, we are maximizing
The log p-model(h) term can be sparsity-inducing. For example, the Laplace prior
corresponds to an absolute value sparsity penalty. Expressing the log-prior as an absolute

value penalty, we obtain
where the constant term depends only on λ and not h.
3.2.Denoising Encoders:
Rather than adding a penalty Ω to the cost function, we can obtain an autoencoder that learns
something useful by changing the reconstruction error term of the cost function.
Autoencoders minimize some function L(x, g(f(x))). This encourages g ◦ f to learn to be
merely an identity function if they have the capacity to do so. A denoising autoencoder
(DAE) minimizes L (x, g(f(x˜))) where ˜x is a copy of x that has been corrupted by some
form of noise.
3.3. Regularized Encoders:

Another strategy for regularizing an autoencoder is to use a penalty Ω as in sparse
autoencoders
L(x, g(f(x))) + Ω(h, x)

Where
This forces the model to learn a function that does not change much when x changes slightly.
Because this penalty is applied only at training examples, it forces the autoencoder to learn
features that capture information about the training distribution. An autoencoder regularized
in this way is called a contractive autoencoder or CAE.
4. Stochastic Encoder and Decoders
In the case of an autoencoder, x is now the target as well as the input. Given a hidden code h,
we may think of the decoder as providing a conditional distribution p-decoder(x | h). Then
train the autoencoder by minimizing −log (p-decoder(x | h)). the output variables are treated
as being conditionally independent given h so that this probability distribution is inexpensive
to evaluate, but some techniques such as mixture density outputs allow tractable modelling of
outputs with correlations.
Any latent variable model p-model(h, x) defines a stochastic encoder and a stochastic decoder
p-encoder(h | x) = p-model(h | x)
p-decoder(x | h) = p-model(x | h).
The encoder and decoder distributions are not necessarily conditional distributions
compatible with a unique joint distribution p-model(x, h).
5. Boltzmann Machines
Boltzmann machine over a d-dimensional binary random vector x ∈ {0, 1}^d. The Boltzmann
machine is an energy-based model. the joint probability distribution using an energy function
where E(x) is the energy function and Z is the partition function that ensures sum of the
probabilities equal to 1. The energy function of the Boltzmann machine is given by
where U is the “weight” matrix of model parameters and b is the vector of bias parameters.
Boltzmann machine becomes more powerful when not all the variables are observed. In this
case, the non-observed variables, or latent variables, can act similarly to hidden units in a
multi-layer perceptron and model higher-order interactions among the visible units. The
Boltzmann
machine becomes a universal approximator of probability mass functions over discrete
variables
we decompose the units x into two subsets: the visible units v and the latent (or hidden) units
h. The energy function becomes
6. Restricted Boltzmann Machines
(a) The restricted Boltzmann machine itself is an undirected graphical model based on a
bipartite graph, with visible units in one part of the graph and hidden units in the other
part. There are no connections among the visible units, nor any connections among
the hidden units. Typically every visible unit is connected to every hidden unit but it is
possible to construct sparsely connected RBMs such as convolutional RBMs.
(b) deep belief network is a hybrid graphical model involving both directed and
undirected
connections. Like an RBM, it has no intra-layer connections. However, a DBN has
multiple hidden layers, and thus there are connections between hidden units that are in
separate layers. All of the local conditional probability distributions needed by the
deep
belief network are copied directly from the local conditional probability distributions
of its constituent RBMs. Alternatively, we could also represent the deep belief
network with a completely undirected graph, but it would need intra-layer connections
to capture the dependencies between parents.
(c) A deep Boltzmann machine is an undirected graphical model with several layers of
latent variables. Like RBMs and DBNs, DBMs lack intra-layer connections. DBMs
are less closely tied to RBMs than DBNs are. When initializing a DBM from a stack
of RBMs, it is necessary to modify the RBM parameters slightly. Some kinds of
DBMs may be trained without first training a set of RBMs
Restricted Boltzmann machine is an energy-based model with the joint probability

distribution specified by its energy function and the energy function E(v, h) is given by: here
Z is normalization constant.
6.1. Conditional Distributions
Though P (v) is intractable, the bipartite graph structure of the RBM has the very special
property that its conditional distributions P(h | v) and P(v | h) are given as:
simple matter of normalizing the distributions over the individual binary hj .
Similarly:
7. Deep Belief networks
Deep belief networks (DBNs) were one of the first non-convolutional models to successfully
admit training of deep architectures. Deep belief networks are generative models with several
layers of latent variables. The latent variables are typically binary, while the visible units may
be binary or real. There are no intra-layer connections. Usually, every unit in each layer is
connected to every unit in each neigh-boring layer, though it is possible to construct more
sparsely connected DBNs. The connections between the top two layers are undirected. The
connections between all other layers are directed, with the arrows pointed toward the layer
that is closest to the data. See above figure
A DBN with l hidden layers contains l weight matrices: , It also contains l+ 1

bias vectors: The probability distribution represented by the DBN is
given by
In the case of real-valued visible units, substitute
with β diagonal for tractability. Generalizations to other exponential family visible units are
straightforward, at least in theory. A DBN with only one hidden layer is just an RBM
DBN Issues:
• Inference in a deep belief network is intractable due to
– explaining away effect within each directed layer
– the interaction between the two hidden layers that have undirected
connections.
– maximizing the standard evidence lower bound on the log-likelihood is also
intractable.
• Exact inference can be intractable, leading to the need for approximate inference
techniques.
1. Explaining Away Effect within Each Directed Layer:
 Each layer in a DBN typically contains directed connections (it's a Bayesian
network).
 During inference in a layer, all possible configurations of hidden units within that
layer are considered.
 This leads to a combinatorial explosion in the number of configurations, making
exact inference intractable.
 The explaining away effect occurs when the states of hidden units in the layer
become correlated, making it challenging to compute the posterior probabilities
efficiently.
2. Interaction Between Two Hidden Layers with Undirected Connections:
 DBNs consists multiple hidden layers with both directed and undirected
connections.
 The interaction between these layers can create complex dependencies that are
challenging to model and compute.
 The undirected connections often result in high-dimensional, continuous
distributions that are difficult to handle analytically.
3. Maximizing the Standard Evidence Lower Bound (ELBO) on the Log-
Likelihood Is Intractable:
 Train DBNs using techniques like variational inference or expectation-maximization
(EM) to maximize the ELBO on the log-likelihood.
 However, computing the ELBO often involves integrals over high-dimensional
spaces, leading to computationally intractable, especially for deep networks with
many layers.
Deep Boltzmann Machines (DBMs):
 It is deep generative model and unlike the deep belief network (DBN), it is an entirely
undirected model.
 Unlike the RBM, the DBM has several layers of latent variables (RBMs have just
one).
 Similar to RBM, within each layer, each of the variables are mutually independent,
conditioned on the variables in the neighboring layers.
 Comparing to RBM energy function, the DBM energy function includes connections
between the hidden units (latent variables) in the form of the weight matrices (W(2)
and W (3)).
 These connections have significant consequences for both the model behavior and
inference in the model.
 The DBM layers can be organized into a bipartite graph, with odd layers on one side
and even layers on the other.
 This immediately implies that when we condition on the variables in the even layer,
the variables in the odd layers become conditionally independent.
Properties of DBMs:
DBMs are developed after Deep Belief Networks (DBNs) and have following interesting
properties.
 Simplicity of Posterior Distribution:
 The posterior distribution P(h | v) in DBMs is simpler compared to Deep Belief

Networks (DBNs).
 This simplicity is a notable property of DBMs.
 Rich Approximations of Posterior:
 The simplicity of the posterior distribution in DBMs allows for richer approximations
of the posterior.
 This means that DBMs can provide more accurate estimates of the posterior
distribution, which is essential for various tasks such as generative modeling and
inference.
 Conditional Independence of Hidden Units:
 In DBMs, all of the hidden units within a layer are conditionally independent given
the other layers.
 This lack of intralayer interaction is a distinctive feature of DBMs.
 Optimization of Variational Lower Bound:
 The conditional independence of hidden units in DBMs allows the use of fixed point
equations to optimize the variational lower bound.
 DBMs can find the true optimal mean field expectations within some numerical
tolerance.
 This property indicates that DBMs can achieve better optimization of model
parameters and more accurate inference.
 Relevance to Neuroscience:
 DBMs are interesting from the point of view of neuroscience.

 DBMs capture the influence of top-down feedback interactions, similar to the human
brain's use of top-down feedback connections.
 Used as computational models of real neuroscientific phenomena.
 Sampling Challenges in DBMs:

 Sampling from DBMs is relatively difficult.
 DBNs only need to use MCMC (Markov Chain Monte Carlo) sampling in their top
pair of layers.
 Whereas DBMs requires MCMC sampling across all layers, with every layer
participating in every Markov chain transition.

Module 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 4

Uploaded by

Copyright:

Available Formats

UNIT – IV

2. Under Complete Auto Encoders:

3. Regularized Auto Encoders:

3.1. Sparse Encoders

L(x, g(f(x))) + Ω(h)

p-model(x, h) =p-model(h)*p-model(x | h).

corresponds to an absolute value sparsity penalty. Expressing the log-prior as an absolute

where the constant term depends only on λ and not h.

3.3. Regularized Encoders:

L(x, g(f(x))) + Ω(h, x)

4. Stochastic Encoder and Decoders

6. Restricted Boltzmann Machines

Restricted Boltzmann machine is an energy-based model with the joint probability

simple matter of normalizing the distributions over the individual binary hj .

7. Deep Belief networks

A DBN with l hidden layers contains l weight matrices: , It also contains l+ 1

 Simplicity of Posterior Distribution:

 The posterior distribution P(h | v) in DBMs is simpler compared to Deep Belief

 Rich Approximations of Posterior:

 Conditional Independence of Hidden Units:

 Optimization of Variational Lower Bound:

 DBMs are interesting from the point of view of neuroscience.

 Sampling Challenges in DBMs:

You might also like