Deep Learning Models

Unit 4
 Deep RNNs
The concept of depth in an RNN is not as clear as it is in feed forward neural networks. By
carefully analysing and understanding the architecture of an RNN, however, we find three
points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-
hidden transition and (3) hidden-to-output function. Based on this observation, we propose
two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking
multiple recurrent layers to build a deep RNN.
The standard method for building this sort of deep RNN is strikingly simple: we stack the
RNNs on top of each other. Given a sequence of length, the first RNN produces a sequence
of outputs, also of length. These, in turn, constitute the inputs to the next RNN layer. In this
short section, we illustrate this design pattern and present a simple example for how to code
up such stacked RNNs. Below; in Fig we illustrate a deep RNN with hidden layers. Each
hidden state operates on a sequential input and produces a sequential output. Moreover, any
RNN cell (white box in Fig.) at each time step depends on both the same layer‟s value at the
previous time step and the previous layer‟s value at the same time step.
Formally, suppose that we have a minibatch input (number of examples: n,

number of inputs in each example: d) at time step t. At the same time step, let the hidden state
of the lth hidden layer (l=1,…, L) be (number of hidden units: h) and the output
layer variable be (number of outputs: q). Setting , the hidden state of
th
the l hidden layer that uses the activation function ϕl is calculated as follows:
where the weights and , together with the bias , are the
th
model parameters of the l hidden layer.
In the end, the calculation of the output layer is only based on the hidden state of the
final Lth hidden layer:
Where the weight and the bias are the model parameters of the
output layer.
Just as with MLPs, the number of hidden layers L and the number of hidden units h are hyper
parameters that we can tune. Common RNN layer widths (h) are in the range (64, 2056), and
common depths (L) are in the range (1,8). In addition, we can easily get a deep gated RNN by
replacing the hidden state computation with that from an LSTM or a GRU.
In deep RNNs, the hidden state information is passed to the next time step of the current layer
and the current time step of the next layer.
 RNN and Vanishing Gradient Problem
Let us have a look at the basic architecture of the recurrent neural network. The image below
is an RNN.
The neural network has an input sequence of [x(1), x(2),…, x(k)]; at a time step t, we provide
an input of x(t). Past information and learned knowledge are encoded in the neural network as
vectors [c(1), c(2),…, c(k-1)], at time step t, the neural network has a state vector of c(t-
1). The state vector c(t-1) and the input vector x(t) these two vectors are attached to make a
complete input vector at time step t, i.e., [c(t-1), x(t)].
The two-weight matrices: Wrec and Win, of the neural network, are connecting to two parts of
the input vector c(t-1) and x(t), to the hidden layer. We ignore the bias vectors in our
calculations and write W = [Wrec, Win] instead.
In the hidden layer, the sigmoid function is utilized as the activation function.
At the last time step, the network produces a single vector (RNNs can output a vector at each
time step, but we'll use this simplified model).
 Back propagation through time (BPTT)
Back-propagation through Time (BPTT) is the training algorithm used to update weights in
recurrent neural networks like LSTMs. It is a gradient-based technique for training certain
types of recurrent neural networks. It can be used to train networks.
Recurrent Neural Networks are those networks that deal with sequential data. They can
predict outputs based on not only current inputs but also considering the inputs that were
generated prior to it. The output of the present depends on the output of the present and the
memory element (which includes the previous inputs). For train these networks, we make use
of traditional backpropagation with an added twist. We don't train the system on the exact
time "t". We train it according to a particular time "t" as well as everything that has occurred
prior to time "t" like the following: t-1, t-2, t-3.
S1, S2, and S3 are the states that are hidden or memory units at the time of t1, t2, and t3,
respectively, while Ws represents the matrix of weight that goes with it. X1, X2, and X3 are
the inputs for the time that is t1, t2, and t3, respectively, while Wx represent the weighted
matrix that goes with it. The numbers Y1, Y2, and Y3 are the outputs of t1, t2, and t3,
respectively as well as Wy, the weighted matrix that goes with it. For any time, t,
We have the following two equations:
St =g1 (Wx xt +Ws S )
Yt = g2 (WY St)
Where g1 and g2 are activation functions. We will now perform the back propagation at time
t = 3.
Let the error function be: Et= (dt-Yt) 2
Here, we employ the squared error, in which D3 is the desired output at a time t = 3.
In order to do backpropagation, it is necessary to change the weights that are associated with
inputs, memory units, and outputs.
In order to do back-propagation, it is necessary to change the weights that are associated with
inputs, memory units, and outputs.
Adjusting Wy
To better understand, we can look at the following image:
Explanation:
E3 is a function of Y3. Hence, we differentiate E3 with respect to Y3.
Y3 is a function of W3. Hence, we differentiate Y3 with respect to W3.
Adjusting Ws
Explanation:
E3 is a function of the Y3. Therefore, we distinguish the E3 with respect to Y3. Y3 is a function
of the S3. Therefore, we differentiate between Y3 with respect to S3.
S3 is an element in the Ws. Therefore, we distinguish between S3 with respect to Ws.
But it's not enough to stop at this, therefore we have to think about the previous steps in time.
We must also differentiate (partially) the error function in relation to the memory units S 2 and
S1, considering the weight matrix Ws.
It is essential to be aware that a memory unit, such as S t, is the result of its predecessor
memory unit, St-1.
Therefore, we distinguish S3 from S2 and S2 from S1.
In general, we can describe this formula in terms of:
Adjusting WX:
Explanation:
E3 is an effect in the Y3. Therefore, we distinguish the E3 with respect to Y3. Y3 is an outcome
that is a function of the S3. Therefore, we distinguish the Y3 with respect to S3.
S3 is an element in the WX. Thus, we can distinguish the S3 with respect to WX.
We can't just stop at this, and therefore we also need to think about the preceding time steps.
Therefore, we separate (partially) the error function in relation to the memory unit S 2 and S1,
considering the WX weighting matrix.
In general, we can define this formula in terms of:
Limitations: This technique that uses the back Propagation over time (BPTT) is a method
that can be employed for a limited amount of time intervals, like 8 or 10. If we continue to
back propagate and the gradient gets too small. This is known as the "Vanishing gradient"
problem. This is because the value of information diminishes geometrically with time.
Therefore, if the number of time steps is greater than 10 (Let's say), the data is effectively
discarded.
 Truncated BPTT:
Truncated Backpropagation Through Time is a modified version of this approach that is more
efficient on sequence prediction problems with very long sequences.
Choosing how many timesteps to utilize as input while training recurrent neural networks like
LSTMs using Truncated Backpropagation Through Time is an important setup choice. That
is, how to break down your extremely long input sequences into sub-sequences for optimal
efficiency.
The input is considered as fixed-length sub-sequences in truncated back-propagation through
time (TBPTT). The hidden state of the previous subsequence is passed on C as input to the
following subsequence in the forward pass. On the other hand, the computed gradient values
are dropped at the end of each subsequence as we move back in gradient computation. The
gradient values at time t' are employed in every time step t if t < t' in normal back-
propagation. If t'-t exceeds the subsequence length, the gradients do not flow from t' to t in
shortened back-propagation.
The TBPTT algorithm requires the consideration of two parameters:
 k1: The number of forward-pass timesteps between updates. Generally, this influences how
slow or fast training will be, given how often weight updates are performed.
 k2: The number of timesteps to which to apply BPTT. Generally, it should be large enough to
capture the temporal structure in the problem for the network to learn. Too large a value
results in vanishing gradients.
Graphical representation of BPTT and TBPTT is shown below:
The full history of activations and inputs in the forward pass (blue arrow above representing
hidden/internal state flow) must be stored for use in the backpropagation step in standard
backpropagation through time (BPTT) (red arrow shows gradient flow). This can be both
computationally and memory intensive, especially for a character language model.
But TBPTT reduces the number of timesteps utilized on the backward pass, allowing it to
estimate rather than calculate the gradient used to update the weights.
Use of TBPTT
Truncated Backpropagation Through Time (BPTT) offers the computational benefits of
BPTT while eliminating the need for a complete retrace through the whole data sequence at
each stage. However, truncation favors short-term dependencies: the gradient estimate of
truncated BPTT is biassed. Therefore, it does not benefit from the stochastic gradient theory's
convergence guarantees.
Preparing Sequence Data for TBPTT
The number of timesteps utilized in the forward and backward passes of BPTT is determined
by how you divide up your sequence data.
Use Data As-Is
TBPTT has been suggested to have practical limitations of 200 to 400 timesteps. You can
reshape the sequence observations as timesteps for the input data if your sequence data is
smaller than or equal to this range.
For example, if you had a collection of 100 univariate sequences with 25 timesteps, you
could reshape it into 100 samples, 25 timesteps, and 1 feature, or [100, 25, 1].
Naive Data Split
If your input sequences are large, such as hundreds of timesteps, you may need to divide
them up into many contiguous subsequences.
If you had 100 input sequences with 50,000 timesteps, for example, each one might be
broken into 100 subsequences with 500 timesteps. One input sequence would provide 100
samples, resulting in a total of 10,000 original samples. Keras' input would have a
dimensionality of 10,000 samples, 500 timesteps, and 1 feature, or [10000, 500, 1]. It would
be necessary to take care to preserve the state throughout every 100 subsequences and to
explicitly or implicitly reset the internal state after every 100 samples.
Domain-Specific Data Split
Knowing the proper number of timesteps to generate a useful estimate of the error gradient
can be difficult.
We can generate a model rapidly using the naïve technique, but the model may not be
optimal. Alternatively, while learning the issue, we may utilize domain-specific knowledge to
predict the number of timesteps that will be important to the model. If the sequence problem
is a regression time series, looking at the autocorrelation and partial autocorrelation plots
might help you decide on the number of timesteps to use.
Systematic Data Split
You can systematically examine a suite of possible subsequence lengths for your sequence
prediction challenge rather than guessing at a reasonable number of timesteps.
You might do a grid search over each sub-sequence length and pick the arrangement that
produces the best overall model.
Lean Heavily On Internal State With TBPTT
Each timestep of your sequence prediction problem may be reformulated as having one input
and one output.
If you had 100 sequences of 50 timesteps, for example, each timestep would be a new
sample. The original 100 samples would be multiplied by 5,000. The three-dimensional input
would be [5000, 1, 1], or 5,000 samples, 1 timestep, and 1 feature.
Again, this would need preserving the internal state of the sequence at each timestep and
resetting it at the conclusion of each real sequence (50 samples).
Decouple Forward and Backward Sequence Length
For the forward and backward passes of Truncated Backpropagation Through Time, the
Keras deep learning package was utilized to support a variable number of timesteps.
In essence, the number of timesteps on input sequences may be used to specify the k1
parameter, while the "truncate gradient" argument on the LSTM layer could be used to
specify the k2 parameter.
 Gated Recurrent Unit (GRU)
GRU or Gated recurrent unit is an advancement of the standard RNN i.e recurrent neural
network. GRUs is very similar to Long Short Term Memory (LSTM). Just like LSTM, GRU
uses gates to control the flow of information. They are relatively new as compared to LSTM.
This is the reason they offer some improvement over LSTM and have simpler architecture.
Another Interesting thing about GRU is that, unlike LSTM, it does not have a separate cell
state (Ct). It only has a hidden state (Ht). Due to the simpler architecture, GRUs is faster to
train.
A GRU is a very useful mechanism for fixing the vanishing gradient problem in recurrent
neural networks. The vanishing gradient problem occurs in machine learning when the
gradient becomes vanishingly small, which prevents the weight from changing its value.
They also have better performance than LSTM when dealing with smaller datasets.
The architecture of Gated Recurrent Unit
Now lets‟ understand how GRU works. Here we have a GRU cell which more or less similar
to an LSTM cell or RNN cell.
At each timestamp t, it takes an input Xt and the hidden state Ht-1 from the previous
timestamp t-1. Later it outputs a new hidden state Ht which again passed to the next
timestamp.
Now there are primarily two gates in a GRU as opposed to three gates in an LSTM cell. The
first gate is the Reset gate and the other one is the update gate.
Reset Gate (Short term memory)
The Reset Gate is responsible for the short-term memory of the network i.e the hidden state
(Ht). Here is the equation of the Reset gate.
If you remember from the LSTM gate equation it is very similar to that. The value of rt will
range from 0 to 1 because of the sigmoid function. Here Ur and Wr are weight matrices for
the reset gate.
Update Gate (Long Term memory)
Similarly, we have an Update gate for long-term memory and the equation of the gate is
shown below.
The only difference is of weight metrics i.e Uu and Wu.

How GRU Works
Now let‟s see the functioning of these gates. To find the Hidden state H t in GRU, it follows a
two-step process. The first step is to generate what is known as the candidate hidden state. As
shown below
Candidate Hidden State
It takes in the input and the hidden state from the previous timestamp t-1 which is multiplied
by the reset gate output rt. Later passed this entire information to the tanh function, the
resultant value is the candidate‟s hidden state.
The most important part of this equation is how we are using the value of the reset gate to
control how much influence the previous hidden state can have on the candidate state.
If the value of rt is equal to 1 then it means the entire information from the previous hidden
state Ht-1 is being considered. Likewise, if the value of rt is 0 then that means the
information from the previous hidden state is completely ignored.
Hidden state
Once we have the candidate state, it is used to generate the current hidden state Ht. It is where
the Update gate comes into the picture. Now, this is a very interesting equation, instead of
using a separate gate like in LSTM in GRU we use a single update gate to control both the
historical information which is Ht-1 as well as the new information which comes from the
candidate state.
Now assume the value of ut is around 0 then the first term in the equation will vanish which
means the new hidden state will not have much information from the previous hidden state.
On the other hand, the second part becomes almost one that essentially means the hidden
state at the current timestamp will consist of the information from the candidate state only.
Similarly, if the value of ut is on the second term will become entirely 0 and the current
hidden state will entirely depend on the first term i.e the information from the hidden state at
the previous timestamp t-1.
Hence we can conclude that the value of ut is very critical in this equation and it can range
from 0 to 1.
 Long Short - Term Memory
Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial
intelligence and deep learning. Unlike standard feed-
forward neural networks, LSTM has feedback connections. Long Short Term Memory is a
kind of recurrent neural network. In RNN output from the last step is fed as input in the
current step. It tackled the problem of long-term dependencies of RNN in which the RNN
cannot predict the word stored in the long-term memory but can give more accurate
predictions from the recent information. As the gap length increases RNN does not give an
efficient performance. LSTM can by default retain the information for a long period of time.
It is used for processing, predicting, and classifying on the basis of time-series data.
Structure of LSTM:
LSTM has a chain structure that contains four neural networks and different memory blocks
called cells.
Information is retained by the cells and the memory manipulations are done by
the gates. There are three gates –
1. Forget Gate: The information that is no longer useful in the cell state is removed with the
forget gate. Two inputs x_t (input at the particular time) and h_t-1 (previous cell output)
are fed to the gate and multiplied with weight matrices followed by the addition of bias.
The resultant is passed through an activation function which gives a binary output. If for a
particular cell state the output is 0, the piece of information is forgotten and for output 1,
the information is retained for future use.
2. Input gate: The addition of useful information to the cell state is done by the input gate.
First, the information is regulated using the sigmoid function and filter the values to be
remembered similar to the forget gate using inputs h_t-1 and x_t. Then, a vector is created
using tanh function that gives an output from -1 to +1, which contains all the possible
values from h_t-1 and x_t. At last, the values of the vector and the regulated values are
multiplied to obtain the useful information.
3. Output gate: The task of extracting useful information from the current cell state to be
presented as output is done by the output gate. First, a vector is generated by applying tanh
function on the cell. Then, the information is regulated using the sigmoid function and filter
by the values to be remembered using inputs h_t-1 and x_t. At last, the values of the vector
and the regulated values are multiplied to be sent as an output and input to the next cell.
Some of the famous applications of LSTM include:

1. Language Modelling
2. Machine Translation
3. Image Captioning
4. Handwriting generation
5. Question Answering Chatbots
The unit is called a long short-term memory block because the program is using a structure
founded on short-term memory processes to create longer-term memory. The unit is called a
long short-term memory block because the program is using a structure founded on short-
term memory processes to create longer-term memory. These systems are often used, for
example, in natural language processing. The recurrent neural network uses the long short-
term memory blocks to take a particular word or phoneme, and evaluate it in the context of
others in a string, where memory can be useful in sorting and categorizing these types of
inputs.
 LSTM Solving Vanishing Gradient Problem
At time step t the LSTM has an input vector of [h(t-1), x(t)]. The cell state of the LSTM unit
is defined by c(t). The output vectors that are passed through the LSTM network from time
step t to t+1 are denoted by h(t).
The three gates of the LSTM unit cell that update and control the cell state of the neural
network are the forgot gate, the input gate, and the output gate.
(To understand the structure of the LSTM unit, visit this blog)
The forget gate determines which information in the cell state should be forgotten when fresh
data enters the network. The output of the forgot gate is given by:
Given the new input information, the input gate determines what new information is encoded
into the cell state. The output of the input gate is given by:
The output of the input gate includes the product of the outputs of two fully connected layers.
The output gate, which is controlled by the output vector h, determines what information
encoded in the cell state is delivered to the network as input in the next time step t.
The activation of the output gate is given by:
The output vector of the cell is given by:
Therefore the form of the cell state will be:
 Backpropagation Through Time in LSTM

On the kth time step, our LSTM network, generates a prediction vector h(k), like the RNN
model. Long-term dependencies and relationships in sequential data are captured by the
knowledge contained in state vectors c(t).
The data sequences might be hundreds or thousands of time steps long, making learning with
a standard RNN exceedingly challenging.
The gradient is computed across T time steps, and it is used to update the network
parameters.
The error term gradient is equal to the sum of T gradients, much like in RNNs.
All of these T subgradients must vanish for the total error gradient to vanish. If we consider
(3) as a series of functions, the sequence of partial sums tends to zero, and hence the series
converges to zero.
the series of partial sums
where
tends to zero.
If we want equation number 3 not to vanish, we just need to increase the likelihood that at
least one of the sub gradients must not vanish. In other words, make the series of sub
gradients not to vanish (converge to zero) in equation (3).
The Error Gradients in the LSTM Network
The gradient of the error for some time step k has the form:
The following product term leads to vanishing gradient problems.
The state vector c(t) in the LSTM has the following form:
In short, we can write it in the following format.
It's important to remember that the state vector c(t) is a function of the following elements,
which should be considered while computing the derivative during backpropagation:
After computing the derivative of the equation number 5, we get:
Now we will compute the four derivative terms and write:

Denoting the four elements having the derivative of the cell state by:
Sum up all the gradients
Now plug equation no. 6 into equation no. 4, we will obtain the LSTM state gradient:
Preventing the Error Gradient from Vanishing

The gradient holds the activation vector of the forget gate, which helps the network to better
regulate the gradient values at each time step by updating the forget gate's parameters. The
activation of the forget gate allows the LSTM to select whether or not particular information
should be remembered at each time step and update the model's parameters accordingly.
Let us say that at any time step k < T, we have that:
Then, at time step k+1, we may find an appropriate parameter update of the forget gate such
that the gradient does not disappear.
The existence of the forget gate's activation vector in the gradient term, and additive structure
allows the LSTM to identify such a parameter update at each time step, yielding:
Now the gradient doesn‟t vanish.
 Encoding and decoding in RNN network

Encoder-decoder model in Recurrent Neural Network (RNN):- Encoder-decoder models are a
type of recurrent neural network that are used to learn how to map one sequence of inputs
into another sequence of outputs. The encoder part of the model is responsible for encoding
the input sequence into a fixed-length vector, and the decoder is responsible for decoding the
output vector back into a sequence of outputs. This type of model can be used for tasks such
as machine translation, where the input is a sentence in one language and the output is a
translated version of that sentence in another language.
Encoders with Recurrent Neural Networks: - In general, a text encoder turns text into a
numeric representation. This task can be implemented in many different ways but, in this
tutorial, what we mean by encoders are RNN encoders. Let‟s see a diagram:
Depending
on the textbook, we can find it in rolled representation: So each block is composed of the
following elements at time t:
So each block is composed of the following elements at time {t}:

Block Input:
Input Vector \vec{x_t} (encoding the word)
Hidden state vector \vec{h_{t-1}} (containing the sequence state before the current block)
Block Output:
Output vector \vec{o_t}, which is not always produced by each block, as we‟ll see in a few
moments
Weights:
{W_x} weights between \vec{x_t} and \vec{h_t}
{W_h} weights between \vec{h_{t+1}} and \vec{h_t}
{W_x} weights between \vec{h_t} and \vec{o_t}
Decoders with Recurrent Neural Networks :- Unlike encoders, decoders unfold a vector
representing the sequence state and return something meaningful for us like text, tags, or
labels. An essential distinction with encoders is that decoders require both, the hidden state
and the output from the previous state. When the decoder starts processing, there‟s no
previous output, so we use a special token for those cases. Let‟s make it clearer with the
example below, which shows how machine translation works:
The encoder produced state C representing the sentence in the source language (English): I
love learning. Then, the decoder unfolded that state C into the target language (Spanish):
Amo el aprendizaje. C could be considered a vectorized representation of the whole sequence
or, in other words, we could use an encoder as a rough mean to obtain embedding‟s from a
text of arbitrary length, but this is not the proper way to do it, as we‟ll see in another tutorial.
 Attention:
A neural network is considered to be an effort to mimic human brain actions in a simplified
manner. Attention Mechanism is also an attempt to implement the same action of selectively
concentrating on a few relevant things, while ignoring others in deep neural networks.
Let me explain what this means. Let‟s say you are seeing a group photo of your first school.
Typically, there will be a group of children sitting across several rows, and the teacher will sit
somewhere in between. Now, if anyone asks the question, “How many people are there?”,
how will you answer it?
Simply by counting heads, right? You don‟t need to consider any other things in the photo.
Now, if anyone asks a different question, “Who is the teacher in the photo?”, your brain
knows exactly what to do. It will simply start looking for the features of an adult in the photo.
The rest of the features will simply be ignored. This is the „Attention‟ which our brain is very
adept at implementing.
Understanding the Attention Mechanism

This is the diagram of the Attention model shown in Bahdanau‟s paper. The Bidirectional
LSTM used here generates a sequence of annotations (h1, h2,….., hTx) for each input
sentence. All the vectors h1,h2.., etc., used in their work are basically the concatenation of
forward and backward hidden states in the encoder.
To put it in simple terms, all the vectors h1,h2,h3…., hTx are representations of Tx number
of words in the input sentence. In the simple encoder and decoder model, only the last state of
the encoder LSTM was used (hTx in this case) as the context vector.
But Bahdanau et al put emphasis on embeddings of all the words in the input (represented by
hidden states) while creating the context vector. They did this by simply taking a weighted
sum of the hidden states.
Now, the question is how should the weights be calculated? Well, the weights are also
learned by a feed-forward neural network and I‟ve mentioned their mathematical equation
below.
The context vector ci for the output word yi is generated using the weighted sum of the
annotations:
The weights αij are computed by a softmax function given by the following equation:
eij is the output score of a feedforward neural network described by the function a that
attempts to capture the alignment between input at j and output at i.
Basically, if the encoder produces Tx number of “annotations” (the hidden state vectors) each
having dimension d, then the input dimension of the feedforward network is (Tx , 2d)
(assuming the previous state of the decoder also has d dimensions and these two vectors are
concatenated). This input is multiplied with a matrix Wa of (2d, 1) dimensions (of course
followed by addition of the bias term) to get scores eij (having a dimension (Tx , 1)).
On the top of these eij scores, a tan hyperbolic function is applied followed by a softmax to
get the normalized alignment scores for output j:
E = I [Tx*2d] * Wa [2d * 1] + B[Tx*1]
α = softmax (tanh(E))
C= IT * α
So, α is a (Tx, 1) dimensional vector and its elements are the weights corresponding to each
word in the input sentence.
Let α is [0.2, 0.3, 0.3, 0.2] and the input sentence is “I am doing it”. Here, the context vector
corresponding to it will be:
C=0.2*I”I” + 0.3*I”am” + 0.3*I”doing” + + 0.3*I”it” [Ix is the hidden state corresponding
to the word x]
 Attention Over Image
The attention mechanism is a complex cognitive ability that human beings possess. When
people receive information, they can consciously ignore some of the main information while
ignoring other secondary information.
This ability of self-selection is called attention. The attention mechanism allows the neural
network to have the ability to focus on its subset of inputs to select specific features.
It was originally designed in the context of Neural Machine Translation using Seq2Seq
Models, but today we‟ll take a look at its implementation in Image Captioning.
Rather than compressing an entire image into a static representation, the Attention
mechanism allows for salient features to dynamically come to the forefront as and when
needed. This is especially important when there is a lot of clutter in an image.
Let‟ Let‟s take an example to understand better:
Our aim would be to generate a caption like “two white dogs are running on the snow”. To
accomplish this we will see how to implement a specific type of Attention mechanism called
Bahdanau‟s Attention or Local Attention.
In this way, we can see what parts of the image the model focuses on as it generates a
caption. This implementation will require a strong background in deep learning.
Approach to the problem statement
The encoder-decoder image captioning system would encode the image, using a pre-trained
Convolutional Neural Network that would produce a hidden state. Then, it would decode this
hidden state by using an LSTM and generate a caption.
For each sequence element, outputs from previous elements are used as inputs, in
combination with new sequence data. This gives the RNN networks a sort of memory which
might make captions more informative and context aware.
But RNNs tend to be computationally expensive to train and evaluate, so in practice, memory
is limited to just a few elements. Attention models can help address this problem by selecting
the most relevant elements from an input image.
With an Attention mechanism, the image is first divided into n parts, and we compute an
image representation of each When the RNN is generating a new word, the attention
mechanism is focusing on the relevant part of the image, so the decoder only uses specific
parts of the image.
 Hierarchical Attention
Text Classification is one of the most astonishing tasks. In more general terms, We can say
Artificial Intelligence is the field that tries to achieve human-like intelligent models to
simplify the tasks for all of us. We have excellent proficiency in text classification, but even
many advanced NLP models have failed to achieve mastery even close to it. So the question
arises what do we humans do differently? How do we classify text?
We know that words will form sentences; sentences include a document, or character forms a
word at a lower level. We can guess many unknown words just by the structure of a sentence.
Then we interpret the message that those series of sentences impart. Then from these series of
sentences, we can understand the meaning of a paragraph. In the Hierarchical Attention
model, we perform similar things.
Hierarchical Attention Network uses stacked recurrent neural networks on word level,
followed by an attention network. The goal is to extract such words that are important to the
meaning of the entire sentence and aggregate these instructional words to form a vector of the
sentence. The same technique is applied to the derived sentence vectors, which generate a
vector that draws up the meaning of the given document, passing the vector further for text
classification.
The intention is to derive sentence meaning from the informative words and derive the
document's meaning from those informative sentences. All words are not equally important.
Some of the words distinguish a sentence more than other words. So, we use the attention
network so that the sentence vector can have more attention to informative words.
The attention model consists of two parts:
1) Bidirectional Recurrent Neural Network
2) Attention networks.
Bidirectional RNN learns the meaning behind those sequences of words and returns a vector
corresponding to each word.
The attention network gets weights corresponding to each word vector using its external
neural network. It then aggregates the representation of these words to form a sentence
vector, which means that it computes the weighted sum of every vector. This weighted sum
personifies the entire sentence. The same steps apply to sentence vectors so that the resulting
vector illustrates the gist of the whole document. It has two levels of attention models, called
Hierarchical Attention Networks.

Deep Learning Models

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Models

Uploaded by

Copyright:

Available Formats

Unit 4

Formally, suppose that we have a minibatch input (number of examples: n,

The only difference is of weight metrics i.e Uu and Wu.

 Long Short - Term Memory

Some of the famous applications of LSTM include:

The output vector of the cell is given by:

Therefore the form of the cell state will be:

 Backpropagation Through Time in LSTM

The following product term leads to vanishing gradient problems.

In short, we can write it in the following format.

After computing the derivative of the equation number 5, we get:

Now we will compute the four derivative terms and write:

Sum up all the gradients

Preventing the Error Gradient from Vanishing

Now the gradient doesn‟t vanish.

 Encoding and decoding in RNN network

So each block is composed of the following elements at time {t}:

Understanding the Attention Mechanism

You might also like