Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

Gated Recurrent Units (GRUs) are a type of recurrent neural network (RNN) architecture similar to

Long Short-Term Memory (LSTM) networks. GRUs were introduced as a simpler alternative to
LSTMs while achieving comparable performance on sequential tasks. Here's an explanation of how
GRUs work:

1. **Update Gate**: Like LSTMs, GRUs use gating mechanisms to control the flow of information
within the network. However, GRUs incorporate only two gates: an update gate (\(z_t\)) and a reset
gate (\(r_t\)).

2. **Reset Gate**: The reset gate (\(r_t\)) determines how much of the past information should be
forgotten. It is computed based on the current input (\(x_t\)) and the previous hidden state (\(h_{t-
1}\)) using a sigmoid activation function. The reset gate decides which parts of the past hidden state
should be considered in computing the candidate activation.

3. **Update Gate**: The update gate (\(z_t\)) decides how much of the new candidate activation
should be added to the current hidden state. It is computed in a similar way to the reset gate but
controls the trade-off between the new candidate activation (\(\tilde{h}_t\)) and the previous hidden
state (\(h_{t-1}\)).

4. **Candidate Activation**: The candidate activation (\(\tilde{h}_t\)) is a proposed update to the


hidden state at the current timestep. It is computed using the current input (\(x_t\)) and a reset gate-
modulated version of the previous hidden state (\(h_{t-1}\)). This candidate activation is then
combined with the previous hidden state to produce the updated hidden state (\(h_t\)).

5. **Mathematical Formulation**: The computations in a GRU cell can be summarized as follows:


- Reset Gate: \(r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)\)
- Update Gate: \(z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)\)
- Candidate Activation: \(\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t] + b_h)\)
- Updated Hidden State: \(h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\)
where \(W_r\), \(W_z\), and \(W_h\) are weight matrices, \(b_r\), \(b_z\), and \(b_h\) are bias
vectors, \(\sigma\) represents the sigmoid function, and \(\odot\) denotes element-wise multiplication.

6. **Training**: GRUs are trained using gradient-based optimization algorithms such as stochastic
gradient descent (SGD) or Adam. The parameters of the GRU cells, including the weights and biases,
are updated iteratively to minimize a loss function that measures the discrepancy between the
predicted output and the ground truth.

GRUs have fewer parameters compared to LSTMs due to their simpler architecture, which makes
them faster to train and more computationally efficient. They have been successfully applied in
various sequential tasks, including natural language processing, speech recognition, and time series
prediction. However, the choice between using GRUs or LSTMs often depends on the specific
requirements of the task at hand and empirical performance comparisons.

You might also like