Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

UNIT-II

UNIT II: Training Neural Network: Risk minimization, loss function, back propagation,
regularization, model selection, and optimization.
Conditional Random Fields: Linear chain, partition function, Markov network, Belief
propagation, Training CRFs, Hidden Markov Model, Entropy.

Training Neural Network:


Imagine you are a mountain climber on top of a mountain, and night has fallen. You need to
get to your base camp at the bottom of the mountain, but in the darkness with only your dinky
flashlight, you can’t see more than a few feet of the ground in front of you. So how do you get
down? One strategy is to look in every direction to see which way the ground steeps downward the
most, and then step forward in that direction. Repeat this process many times, and you will
gradually go farther and farther downhill. You may sometimes get stuck in a small trough or valley,
in which case you can follow your momentum for a bit longer to get out of it. Caveats aside, this
strategy will eventually get you to the bottom of the mountain.
The primary technique for doing so, gradient descent, sounds much like what we just described.
Given the right type of data, a fairly simple model will provide better and faster results than a
complex DNN
So, whether you are working with Computer Vision, Natural Language Processing, Statistical
Modelling, etc. try to preprocess your raw data. A few measures one can take to get better training
data:

 Get your hands on as large a dataset as possible(DNNs are quite data-hungry: more is
better)
 Remove any training sample with corrupted data (short texts, highly distorted images,
spurious output labels, features with lots of null values, etc.)
 Data Augmentation - create new examples (in case of images - rescale, add noise, etc.)

Choose appropriate activation functions:


For years, sigmoid activation functions have been the preferable choice. But, a sigmoid function is
inherently cursed by these two drawbacks –
1) Saturation of sigmoid at tails (further causing vanishing gradient problem).
2) Sigmoid are not zero-centered.
A better alternative is a tanh function - mathematically, tanh is just a rescaled and shifted
sigmoid, tanh(x) = 2*sigmoid(x) - 1. Although tanh can still suffer from the vanishing gradient
problem, but the good news is - tanh is zero-centered. Hence, using tanh as activation function
will result into faster convergence. I have found that using tanh as activations generally works
better than sigmoid.
Vanishing gradient problem In a network of n hidden layers, n derivatives will be
multiplied together. If the derivatives are large then the gradient will increase exponentially as

1
we propagate down the model until they eventually explode, and this is what we call the
problem of exploding gradient.
Number of Hidden Units and Layers:
Keeping a larger number of hidden units than the optimal number, is generally a safe bet.
Since, any regularization method will take care of superfluous units, at least to some extent. On the
other hand, while keeping smaller numbers of hidden units (than the optimal number), there are
higher chances of under fitting the model.
Weight Initialization:
Always initialize the weights with small random numbers to break the symmetry between
different units. If weights are initialized to very large numbers, then the sigmoid will saturate(tail
regions), resulting into dead neurons.
If weights are very small, then gradients will also be small. Therefore, it’s preferable to
choose weights in an intermediate range, such that these are distributed evenly around a mean
value.
Learning Rates:
Set the learning rate too small and your model might take ages to converge, make it too
large and within initial few training examples, your loss might shoot up to sky. Generally, a
learning rate of 0.01 is a safe bet, but this shouldn’t be taken as a stringent rule; since the optimal
learning rate should be in accordance to the specific task.

Risk minimization:
If you compute the loss using the data points in our dataset, it's called empirical risk. It's
“empirical” and not “true” because we are using a dataset that's a subset of the whole population. ...
This process of finding this function is called empirical risk minimization. Ideally, we would like to
minimize the true risk
The empirical error is also sometimes called the generalization error. The reason is that
actually, in most problems, we don't have access to the whole domain X of inputs, but only our
training subset S. We want to generalize based on S, also called inductive learning.
The goal of a machine learning algorithm is to reduce the expected generalization error given by
equation below, This quantity is known as the risk.

We emphasize here that the expectation is taken over the true underlying distribution Pdata. If we
knew the true distribution Pdata(x, y), risk minimization would be an optimization task solvable by
an optimization algorithm. However, when we do not know Pdata (x, y) but only have a training set
of samples, we have a machine learning problem.
The simplest way to convert a machine learning problem back into an optimization problem
is to minimize the expected loss on the training set. This means replacing the true distribution p(x,
2
y) with the empirical distribution ˆ p(x, y) defined by the training set. We now minimize the
empirical risk

Where m is the number of training examples.


The training process based on minimizing this average training error is known as empirical
risk minimization. In this setting, machine learning is still very similar to straightforward
optimization. Rather than optimizing the risk directly, we optimize the empirical risk, and hope that
the risk decreases significantly as well. A variety of theoretical results establish conditions under
which the true risk can be expected to decrease by various amounts.
However, empirical risk minimization is prone to overfitting. Models with high capacity can
simply memorize the training set. In many cases, empirical risk minimization is not really feasible.
The most effective modern optimization algorithms are based on gradient descent, but many useful
loss functions, such as 0-1 loss, have no useful derivatives (the derivative is either zero or undefined
everywhere). These two problems mean that, in the context of deep learning, we rarely use
empirical risk minimization. Instead, we must use a slightly different approach, in which the
quantity that we actually optimize is even more different from the quantity that we truly want to
optimize.
Overfitting:
Overfitting happens when a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new data. This means that the noise or random
fluctuations in the training data is picked up and learned as concepts by the model.

https://www.youtube.com/watch?v=BqzgUnrNhFM
Bias: variance in training data
Variance: variance in test data

3
Loss function:
The function we want to minimize or maximize is called the objective function or criterion.
When we are minimizing it, we may also call it the cost function, loss function, or error function.
The Loss Function is one of the important components of Neural Networks. Loss is nothing
but a prediction error of Neural Net. And the method to calculate the loss is called Loss Function. In
simple words, the Loss is used to calculate the gradients.

4
Neural Network uses optimizing strategies like stochastic gradient descent to minimize the
error in the algorithm. The way we actually compute this error is by using a Loss Function. It is
used to quantify how good or bad the model is performing. These are divided into two categories
i.e. Regression and Classification Loss.
Regression Loss:
Regression Loss is used when we are predicting continuous values like the price of a house or sales
of a company.

 Mean Squared Error

Mean Squared Error is the mean of squared differences between the actual and predicted value.
If the difference is large the model will penalize it as we are computing the squared difference.

 Mean Squared Logarithmic Error Loss

Suppose we want to reduce the difference between the actual and predicted variable we can take the
natural logarithm of the predicted variable then take the mean squared error. This will overcome the
problem possessed by the Mean Square Error Method. The model will now penalize less in
comparison to the earlier method.

 Mean Absolute Error Loss

Sometimes there may be some data points which far away from rest of the points i.e outliers, in case
of cases Mean Absolute Error Loss will be appropriate to use as it calculates the average of the
absolute difference between the actual and predicted values.
Binary Classification Loss Function
Suppose we are dealing with a Yes/No situation like “a person has diabetes or not”, in this kind of
scenario Binary Classification Loss Function is used.

 Binary Cross Entropy Loss

It gives the probability value between 0 and 1 for a classification task. Cross-Entropy calculates the
average difference between the predicted and actual probabilities.

 Hinge Loss

This type of loss is used when the target variable has 1 or -1 as class labels. It penalizes the model
when there is a difference in the sign between the actual and predicted class values. Used in SVM
Models.
Multi-Class Classification Loss Function
if we take a dataset like Iris where we need to predict the three-class labels: Setosa, Versicolor and
Virginia, in such cases where the target variable has more than two classes Multi-Class
Classification Loss function is used.

5
Gradient descent
Gradient descent is an optimization algorithm that's used when training a machine learning model.
It's based on a convex function and tweaks its parameters iteratively to minimize a given function to
its local minimum

When you consider all the data points derivative of loss wrt derivative of old then it is GD
If it is done for 1 point then stochastic gradient decent (SGTD)
If it is K points where (K<N), N represent s all the data points then it is Mini Batch SGD

Local Minimum Vs Global Minimum


A local minimum of a function (typically a cost function in machine learning, which is
something we want to minimize based on empirical data) is a point in the domain of a function that
has the following property
On the other hand, a global minimum of a function minimizes the function on its entire
domain, and not just on a neighborhood of the minimum. In other words, the function evaluated at
the global minimum is less than or equal to the function evaluated at any other point.

6
Back propagation:
Back-propagation is just a way of propagating the total loss back into the neural network to
know how much of the loss every node is responsible for, and subsequently updating the weights in
such a way that minimizes the loss by giving the nodes with higher error rates lower weights and
vice versa.
Backpropagation is the essence of neural network training. It is the method of fine-tuning
the weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model reliable
by increasing its generalization.
The Back propagation algorithm in neural network computes the gradient of the loss
function for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a
native direct computation. It computes the gradient, but it does not define how the gradient is used.
It generalizes the computation in the delta rule.

1) Inputs X, arrive through the pre connected path


2) Input is modeled using real weights W. The weights are usually randomly selected.
3) Calculate the output for every neuron from the input layer, to the hidden layers, to the output
layer.
4) Calculate the error in the outputs
ErrorB= Actual Output – Desired Output
5) Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.
7
Keep repeating the process until the desired output is achieved
A feedforward neural network is an artificial neural network where the nodes never form a cycle.
This kind of neural network has an input layer, hidden layers, and an output layer. It is the first and
simplest type of artificial neural network.
Most prominent advantages of Backpropagation are:

 Backpropagation is fast, simple and easy to program


 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about the network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to be learned

Two Types of Backpropagation Networks are:

 Static Back-propagation

It is one kind of backpropagation network which produces a mapping of a static input for
static output. It is useful to solve static classification issues like optical character
recognition.

 Recurrent Backpropagation

Recurrent Back propagation in data mining is fed forward until a fixed value is achieved.
After that, the error is computed and propagated backward.
Backpropagation is especially useful for deep neural networks working on error-prone projects,
such as image or speech recognition.

Regularization:
One of the most common problems data science professionals face is to avoid overfitting.
Have you come across a situation where your model performed exceptionally well on train data but
was not able to predict test data.
Avoiding overfitting can single-handedly improve our model’s performance. we will
understand the concept of overfitting and how regularization helps in overcoming the same
problem.
Regularization:
Regularization is a technique used in machine learning and deep learning to prevent
overfitting and improve the generalization performance of a model. It involves adding a penalty
term to the loss function during training.
This penalty discourages the model from becoming too complex or having large parameter
values, which helps in controlling the model’s ability to fit noise in the training data. Regularization
methods include L1 and L2 regularization, dropout, early stopping, and more. By applying
regularization, models become more robust and better at making accurate predictions on unseen
data.

8
Before we deep dive into the topic, take a look at this image:

Have you seen this image before? As we move towards the right in this image, our model tries to
learn too well the details and the noise from the training data, which ultimately results in poor
performance on the unseen data.

In other words, while going towards the right, the complexity of the model increases such that the
training error reduces but the testing error doesn’t. This is shown in the image below.

if you’ve built a neural network before, you know how complex they are. This makes them more
prone to overfitting.

Regularization is a technique which makes slight modifications to the learning algorithm


such that the model generalizes better. This in turn improves the model’s performance on the
unseen data as well.

9
How does Regularization help reduce Overfitting?

Let’s consider a neural network which is overfitting on the training data as shown in the image
below.

If you have studied the concept of regularization in machine learning, you will have a fair idea
that regularization penalizes the coefficients. In deep learning, it actually penalizes the weight
matrices of the nodes.

Assume that our regularization coefficient is so high that some of the weight matrices are nearly
equal to zero.

This will result in a much simpler linear network and slight underfitting of the training data.
Such a large value of the regularization coefficient is not that useful. We need to optimize the value
of regularization coefficient in order to obtain a well-fitted model as shown in the image below.

10
Different Regularization Techniques in Deep Learning

Now that we have an understanding of how regularization helps in reducing overfitting, we’ll learn
a few different techniques in order to apply regularization in deep learning.

L2 & L1 regularization
L1 and L2 are the most common types of regularization. These update the general cost function by
adding another term known as the regularization term.
 Cost function = Loss (say, binary cross entropy) + Regularization term

Due to the addition of this regularization term, the values of weight matrices decrease because it
assumes that a neural network with smaller weight matrices leads to simpler models. Therefore, it
will also reduce overfitting to quite an extent.

However, this regularization term differs in L1 and L2.

In L2, we have:

Here, lambda is the regularization parameter. It is the hyperparameter whose value is optimized for
better results. L2 regularization is also known as weight decay as it forces the weights to decay
towards zero (but not exactly zero).

In L1, we have:

In this, we penalize the absolute value of the weights. Unlike L2, the weights may be reduced to
zero here. Hence, it is very useful when we are trying to compress our model. Otherwise, we
usually prefer L2 over it.

Dropout
This is the one of the most interesting types of regularization techniques. It also produces very good
results and is consequently the most frequently used regularization technique in the field of deep
learning.

To understand dropout, let’s say our neural network structure is akin to the one shown below:

11
So what does dropout do? At every iteration, it randomly selects some nodes and removes them
along with all of their incoming and outgoing connections as shown below.

So each iteration has a different set of nodes and this results in a different set of outputs. It can also
be thought of as an ensemble technique in machine learning.

Data Augmentation:
The simplest way to reduce overfitting is to increase the size of the training data. In machine
learning, we were not able to increase the size of training data as the labeled data was too costly.

But, now let’s consider we are dealing with images. In this case, there are a few ways of increasing
the size of the training data – rotating the image, flipping, scaling, shifting, etc. In the below image,
some transformation has been done on the handwritten digits dataset.

This technique is known as data augmentation. This usually provides a big leap in improving the
accuracy of the model. It can be considered as a mandatory trick in order to improve our
predictions.
Early stopping
Early stopping is a kind of cross-validation strategy where we keep one part of the training set as
the validation set. When we see that the performance on the validation set is getting worse, we
immediately stop the training on the model. This is known as early stopping.

12
In the above image, we will stop training at the dotted line since after that our model will start
overfitting on the training data.

Model selection:
Model selection is the process of selecting one final machine learning model from among a
collection of candidate machine learning models for a training dataset. Model selection is the
process of choosing one of the models as the final model that addresses the problem.
Model selection is a process that can be applied both across different types of models (e.g.
logistic regression, SVM, KNN, etc.) and across models of the same type configured with different
model hyper parameters (e.g. different kernels in an SVM).
All models have some predictive error, given the statistical noise in the data, the
incompleteness of the data sample, and the limitations of each different model type. Therefore, the
notion of a perfect or best model is not useful. Instead, we must seek a model that is “good enough.”
Therefore, a “good enough” model may refer to many things and is specific to your project, such as:

 A model that meets the requirements and constraints of project stakeholders.


 A model that is sufficiently skillful given the time and resources available.
 A model that is skillful as compared to naive models.
 A model that is skillful relative to other tested models.
 A model that is skillful relative to the state-of-the-art.

Some algorithms require specialized data preparation in order to best expose the structure of the
problem to the learning algorithm. Therefore, we must go one step further and consider model
selection as the process of selecting among model development pipelines.
Each pipeline may take in the same raw training dataset and outputs a model that can be
evaluated in the same manner but may require different or overlapping computational steps, such
as:

 Data filtering.
 Data transformation.
 Feature selection.
 Feature engineering.
 And more…

The best approach to model selection requires “sufficient” data, which may be nearly infinite
depending on the complexity of the problem.
13
In this ideal situation, we would split the data into training, validation, and test sets, then fit
candidate models on the training set, evaluate and select them on the validation set, and report the
performance of the final model on the test set.
This is impractical on most predictive modeling problems given that we rarely have sufficient data,
or are able to even judge what would be sufficient.
there are two main classes of techniques to approximate the ideal case of model selection; they are:

 Probabilistic Measures: Choose a model via in-sample error and complexity.


Probabilistic measures involve analytically scoring a candidate model using both its
performance on the training dataset and the complexity of the model.
It is known that training error is optimistically biased, and therefore is not a good
basis for choosing a model. The performance can be penalized based on how optimistic the
training error is believed to be. This is typically achieved using algorithm-specific methods,
often linear, that penalize the score based on the complexity of the model.

 Resampling Methods: Choose a model via estimated out-of-sample error.

Resampling methods seek to estimate the performance of a model (or more


precisely, the model development process) on out-of-sample data.
This is achieved by splitting the training dataset into sub train and test sets, fitting a
model on the sub train set, and evaluating it on the test set. This process may then be
repeated multiple times and the mean performance across each trial is reported.

Optimization:
The process of minimizing (or maximizing) any mathematical expression is called
optimization. Optimizers are algorithms or methods used to change the attributes of the neural
network such as weights and learning rate to reduce the losses. Optimizers are used to solve
optimization problems by minimizing the function.
Similarly, it’s impossible to know what your model’s weights should be right from the start.
But with some trial and error based on the loss function (whether the hiker is descending), you can
end up getting there eventually.
How you should change your weights or learning rates of your neural network to reduce the
losses is defined by the optimizers you use. Optimization algorithms are responsible for reducing
the losses and to provide the most accurate results possible.
We’ll learn about different types of optimizers and how they exactly work to minimize the loss
function.
1) Gradient Descent – Covered above
2) Stochastic Gradient Descent (SGD) - Covered above
3) Mini Batch Stochastic Gradient Descent (MB-SGD) - Covered above
4) SGD with momentum

14
SGD with momentum overcomes this disadvantage by denoising the gradients. Updates of
weight are dependent on noisy derivative and if we somehow denoise the derivatives then
converging time will decrease.

5) Nesterov Accelerated Gradient (NAG)


The idea of the NAG algorithm is very similar to SGD with momentum with a slight variant.
In the case of SGD with a momentum algorithm, the momentum and gradient are computed
on the previous updated weight.
Momentum may be a good method but if the momentum is too high the algorithm may miss
the local minima and may continue to rise up. So, to resolve this issue the NAG algorithm
was developed. Both NAG and SGD with momentum algorithms work equally well and
share the same advantages and disadvantages.

6) Adaptive Gradient (AdaGrad)


For all the previously discussed algorithms the learning rate remains constant. So the key
idea of AdaGrad is to have an adaptive learning rate for each of the weights.
It performs smaller updates for parameters associated with frequently occurring features,
and larger updates for parameters associated with infrequently occurring features.
7) AdaDelta
The problem with the previous algorithm AdaGrad was learning rate becomes very small
with a large number of iterations which leads to slow convergence. To avoid this, the
AdaDelta algorithm has an idea to take an exponentially decaying average.
8) RMSprop
RMSprop in fact is identical to the first update vector of Adadelta
9) Adam
Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent with
momentum.

15
Conditional Random Fields: Linear chain, partition function, Markov network,
Belief propagation, Training CRFs, Hidden Markov Model, Entropy.
Conditional random fields (CRFs) are a class of statistical modeling methods often applied
in pattern recognition and machine learning and used for structured prediction. Whereas a classifier
predicts a label for a single sample without considering "neighboring" samples, a CRF can take
context into account.

Linear Chain Conditional Random Fields (CRF):


We’ll treat some of the problems as sequence classification problems. We’ll use a well-known
algorithm called Conditional Random Fields (CRFs) to solve these problems.
 CRFs are a class of statistical modeling method often applied in pattern recognition and
machine learning and used for structured prediction. CRFs fall into the sequence modeling
family. Whereas a discrete classifier predicts a label for a single sample without considering
"neighboring" samples, a CRF can take context into account; e.g., the linear chain CRF
(which is popular in natural language processing) predicts sequences of labels for sequences
of input samples.

 CRFs are a type of discriminative undirected probabilistic graphical model. It is used to


encode known relationships between observations and construct consistent interpretations. It
is often used for labeling or parsing of sequential data, such as natural language processing
or biological sequences and in computer vision. Specifically, CRFs find applications in POS
Tagging, shallow parsing, named entity recognition, gene finding and peptide critical
functional region finding, among other tasks, being an alternative to the related hidden
Markov models (HMMs).
To get a sense of how CRFs work, consider the sentence "I’m at home.". Now consider the
sentence "I’m at kwaak.". Based on both sentences one intuitively understands that "kwaak" is some
sort of location because we know that "home" is also a location and the words appear in the same
context.
CRFs take into account the context in which a word appears and some other features like "is the
text made up out of numbers?". More precisely: an input sequence of observed
variables X represents a sequence of observations (the words with the associated features which
make up a sentence) and Y represents a hidden (or unknown) state variable that needs to be inferred
given the observations (the labels). The Yi are structured to form a chain, with an edge between
each Y(i-1) and Yi. As well as having a simple interpretation of the Yi as "labels" for each element in
the input sequence, this layout admits efficient algorithms for:

1. model training, learning the conditional distributions between the Yi and feature functions
from some corpus of training data.

2. decoding, determining the probability of a given label sequence Y given X.

3. inference, determining the most likely label sequence Y given X.

Linear chain conditional random fields (LC-CRFs) are a type of graphical model that is used
in machine learning to model sequential data. They are a variant of conditional random fields
16
(CRFs), which are a type of discriminative probabilistic model used for structured prediction. LC-
CRFs are used to model sequential data where the output is structured as a sequence, such
as natural language processing (NLP), speech recognition, and computer vision.
LC-CRFs are similar to hidden Markov models (HMMs), which are another type of graphical
model used for sequential data. However, LC-CRFs are more flexible than HMMs because they
allow for more complex features to be used in the model. In addition, LC-CRFs can be trained
using maximum likelihood estimation (MLE) or maximum a posteriori (MAP) estimation.
LC-CRFs are used in many applications, including part-of-speech tagging, named entity
recognition, chunking, and segmentation. They have also been used in deep learning models,
such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs), to
model sequential data.

partition function:

In the context of deep learning and conditional random fields (CRFs), the partition function
plays a crucial role in modeling and training sequential or structured prediction tasks. CRFs are a
type of probabilistic graphical model used for tasks like named entity recognition, part-of-speech
tagging, and other sequence labeling tasks.

The partition function in a CRF is also known as the normalization constant or the
marginalization term. It is a sum (or integral in the continuous case) of all possible labelings or
assignments to the variables in the CRF, and it ensures that the probabilities assigned by the CRF
model to all possible labelings sum up to 1.

Mathematically, for a sequence of random variables (nodes) X = {X_1, X_2, ..., X_n} and a set of
potential functions φ, the partition function Z is defined as:

Z = ∑ Φ(X) = ∑ exp(∑ φ_i(X))


In the context of deep learning and CRFs, neural networks are often used to model the
potential functions φ. The potential functions capture the compatibility between the observed input
features and the labels assigned to each element of the sequence. For example, in part-of-speech
tagging, these potential functions would capture the likelihood of a word having a specific part-of-
speech tag given its context.

The partition function Z is used during both training and inference. During training, it is
used to compute the likelihood of the observed data given the model parameters. Inference, which is
the process of predicting the most likely label sequence for a given input, often involves computing
the conditional probabilities of labels given the input features, and the partition function is used to
normalize these probabilities.

In practice, calculating the partition function can be computationally expensive, especially


when dealing with long sequences or complex models. There are various techniques to approximate
the partition function, such as using sampling methods or specialized algorithms like the forward-
backward algorithm for linear-chain CRFs.

17
Overall, the partition function is a fundamental component of CRFs in deep learning,
ensuring that the model assigns valid probabilities to different label sequences and allowing for
effective training and inference in structured prediction tasks.

Markov network:
A Markov network, also known as a Markov random field or undirected graphical model, is a type of
probabilistic graphical model used in machine learning and deep learning for modeling complex, high-
dimensional probability distributions. Markov networks are particularly useful for capturing dependencies
between variables in a probabilistic way.

In the context of deep learning and conditional random fields (CRFs), Markov networks are often
used as a component of models for tasks like structured prediction, sequence labeling, and other tasks that
involve modeling dependencies between variables. Here's how Markov networks relate to conditional
random fields in deep learning:

1) Markov Random Fields (MRFs): A Markov network is a graphical representation of a joint


probability distribution over a set of variables. In the context of deep learning, these variables can
represent various aspects of a problem, such as pixels in an image, words in a sentence, or labels in a
sequence. Each node in the graph represents a variable, and edges represent dependencies between
variables.
2) Conditional Random Fields (CRFs): Conditional Random Fields are a specific type of Markov
network that is often used in structured prediction tasks. CRFs are used when you want to model the
conditional probability distribution of a set of variables given another set of variables. In other
words, they are used for modeling the conditional dependencies between output variables (e.g.,
labels or predictions) given input variables (e.g., features or observations).
3) Inference and Learning: In deep learning, you can use Markov networks or CRFs for various tasks,
such as image segmentation, natural language processing (e.g., named entity recognition or part-of-
speech tagging), and many other structured prediction problems. Learning and inference in these
models involve finding the most likely configuration of the variables given the observed data, which
often requires optimization techniques like belief propagation, gradient-based methods, or other
specialized algorithms.
4) Features and Potentials: In CRFs, you define features or potential functions that capture the
compatibility between variables. These features are typically defined based on the task at hand and
can be learned from data. The CRF combines these features with the observed data to compute the
conditional probabilities.
5) Deep Learning Integration: In the context of deep learning, neural networks are often used to
compute or parameterize the features or potentials in CRFs. This integration allows deep learning
models to capture complex, non-linear dependencies in the data and use them within the structured
prediction framework of CRFs.

Markov networks and CRFs are just one aspect of deep learning for structured data. They are useful for
modeling tasks where the output variables exhibit structured dependencies, and they can be integrated with
deep neural networks to leverage their representation learning capabilities. These models are often used in
tasks like semantic segmentation, named entity recognition, and sequence labeling, among others.

Belief propagation:
Belief Propagation (BP) is a message-passing algorithm commonly used in graphical models and
probabilistic graphical models, such as Bayesian networks and Markov random fields, to perform inference
18
and make predictions. In the context of deep learning, BP can also be applied to Conditional Random Fields
(CRFs) or Conditional Network Fields (CNFs), which are used for tasks like structured prediction, sequence
labeling, and image segmentation. Here's a brief overview of belief propagation in the context of CNFs in
deep learning:

1) Conditional Network Fields (CNFs):

CNFs are a type of probabilistic graphical model that extends the concept of Conditional Random
Fields (CRFs). In CNFs, you model dependencies between random variables in a structured way, while
taking into account the conditional dependencies given the observed data. This makes CNFs well-suited for
tasks where the output is not just a collection of independent labels but has structured relationships.

2) Belief Propagation in CNFs:


 In CNFs, belief propagation can be used to perform probabilistic inference. The goal is to compute
the posterior distribution over the structured output variables given the input data.
 BP in CNFs is an iterative message-passing algorithm that updates the beliefs (probability
distributions) of each variable in the network based on the beliefs of their neighboring variables.
 There are two types of messages exchanged during BP:
 Forward messages: These messages are sent from neighboring variables to the target
variable and represent information about how the target variable depends on its neighbors
given the observed data.
 Backward messages: These messages are sent from the target variable back to its neighbors,
providing information about how the neighbors should adjust their beliefs based on the
target variable's beliefs.
 BP proceeds in a series of message-passing iterations until convergence. At convergence, you have
approximated the posterior distribution over the output variables.

3) Use Cases in Deep Learning:


 CNFs with BP are commonly used in deep learning for tasks such as semantic image segmentation,
part-of-speech tagging, and named entity recognition, among others.
 These tasks often require modeling dependencies between neighboring pixels or tokens in an input
data sequence, and BP in CNFs provides a principled way to do this.

4) Learning in CNFs:
 In the context of deep learning, CNFs can be used as part of a larger model, and they can be learned
from data along with other neural network components.
 Learning in CNFs can involve training the model's parameters to maximize the likelihood of the
observed data, often through methods like maximum likelihood estimation or structured loss
functions.

Belief propagation in Conditional Network Fields is a powerful technique for structured prediction tasks
that involve modeling dependencies between variables. It leverages probabilistic graphical modeling and
message-passing algorithms to make predictions while considering the conditional dependencies in the data.

Training CRFs:
Training Conditional Random Fields (CRFs) in the context of deep learning involves combining the
advantages of deep neural networks with the structured prediction capabilities of CRFs. CRFs are often used
for tasks like sequence labeling, image segmentation, and other structured output prediction tasks. Here's an
overview of how to train CRFs in deep learning:

19
1) Data Preparation:
 First, you need to prepare your data for training. This typically involves creating labeled datasets
where each input has a corresponding structured output. For example, in the case of part-of-
speech tagging, you would have sentences with labeled part-of-speech tags for each word.

2) Feature Extraction:
 For each input, you need to extract features that can be used by your CRF model. These features
can be based on the input data and can be designed to capture relevant information for the
structured prediction task. In the context of deep learning, you might also consider using neural
network embeddings as features.
3) Design the CRF Model:
 The CRF model is designed to capture dependencies between the structured output variables.
You can define a CRF in terms of the following components:
 Unary Potentials: These represent the compatibility between each output label and the
input features. In deep learning, you can use neural networks to compute these unary
potentials.
 Pairwise Potentials: These represent the dependencies between adjacent output labels.
They can also be modeled using neural networks or other learned functions.
 The CRF model can be represented as a graphical model, where nodes represent the output
variables and edges represent pairwise dependencies.

4) Objective Function:
 Define an objective function that measures the difference between the predicted structured output
and the ground truth. Common objective functions include the negative log-likelihood or structured
loss functions like the structured perceptron loss or structured hinge loss.

5) Training:
 Train the CRF model using your labeled dataset and the defined objective function. This can be done
using techniques like stochastic gradient descent (SGD) or other optimization methods.
 Backpropagation is used to compute gradients for the neural network components of the CRF. The
gradients are then used to update the model's parameters.

6) Regularization:
 To prevent overfitting, you can apply regularization techniques such as L1 or L2 regularization to
the neural network components of the CRF.

7) Inference:
 During inference, you use the trained CRF model to make predictions on new, unlabeled data. This
typically involves finding the most likely structured output given the input features. This can be done
using algorithms like Viterbi decoding or beam search.

8) Evaluation:
 Evaluate the performance of your CRF model using appropriate metrics for your task, such as
accuracy, F1-score, or intersection-over-union (IoU) for segmentation tasks.

Training CRFs in deep learning allows you to take advantage of the expressive power of neural networks
for feature extraction while benefiting from the structured prediction capabilities of CRFs. This hybrid
approach is particularly useful for tasks where modeling dependencies between output variables is crucial for
accurate predictions.

20
Hidden Markov Model:
Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) belong to the
family of graphical models in machine learning.

The word “Hidden” symbolizes the fact that only the symbols released by the system are observable,
while the user cannot view the underlying random walk between states. Many in this field recognize HMM
as a finite state machine.

Hidden Markov Model(HMM)

Let’s imagine that we have an English sentence like Math is the language of nature. We
want to label each word with its part of speech. That way, we get a graph:

Let’s see what each color and arrow represent. The sentence is our observed data (shown in
gray circles) and the labels are the hidden information we want to extract (shown in white
circles). In addition, each word depends on its label and each label depends on the previous
label.
This graph represents a first-order hidden Markov model which belongs to the family
of Bayesian networks. The reason why we call it “first-order” is that each hidden variable
depends only on the previous one. We can extend it to a higher-order model by conditioning the
hidden variables on more than one predecessor.
The general first-order HMM graph for any sequence is similar to that of our example:

21
We can assign probabilities to each arrow and model the occurrence of such a sequence using
the product of these probabilities:

where and are the sequence and the array of its labels. The

are emission probabilities and the are transition probabilities. The former is the
set of probabilities for each observation given its corresponding state while the latter is the
probability of a state given its preceding state. Since an HMM models the joint probability
of and , it’s a generative model.

Example:

Let’s imagine we want to tag the words in the sentence I love programming. From our
training data, we can learn the emission and transition probabilities by counting. For instance,
let’s imagine that the training data consists of three sentences:

From there, we calculate the emission probabilities:

and the transition probabilites:

22
To find the optimal labels, we can go over all the combinations of states and observations to
find the sequence that has the maximum probability. For our example, the best sequence is:

Given that our example was a small one with only a couple of words and parts of speech, it’s
easy to enumerate all the possible combinations. However, this can become extremely difficult
to do in more complex and longer sentences. For them, we use inference algorithms such
as Viterbi or Forward-backward.
Advantages of HMM

HMM has a strong statistical foundation with efficient learning algorithms where learning can take
place directly from raw sequence data. It allows consistent treatment of insertion and deletion penalties in the
form of locally learnable methods and can handle inputs of variable length. They are the most flexible
generalization of sequence profiles. It can also perform a wide variety of operations including multiple
alignment, data mining and classification, structural analysis, and pattern discovery. It is also easy to
combine into libraries.

Disadvantages of HMM

 HMM is only dependent on every state and its corresponding observed object:

The sequence labeling, in addition to having a relationship with individual words, also relates to such
aspects as the observed sequence length, word context and others.

 The target function and the predicted target function do not match:
HMM acquires the joint distribution P(Y, X) of the state and the observed sequence, while in the
estimation issue, we need a conditional probability P(Y|X).

Entropy:
Entropy in conditional random fields (CRFs) is often used as a measure of uncertainty or disorder in
the context of deep learning and natural language processing. CRFs are a type of probabilistic graphical
model commonly used for sequence labeling tasks such as part-of-speech tagging, named entity recognition,
and more.

Entropy in CRFs can be used in different ways:

23
1) Model Training: During training, the entropy of a CRF's predictions can be used as a regularization
term. By encouraging the model to have lower entropy in its predictions, it can be guided towards
making more confident and accurate predictions.
2) Inference: When making predictions on new data, the entropy of the model's output can be used to
assess the uncertainty of the predictions. High entropy implies uncertainty, while low entropy
implies confidence.
3) Evaluation: Entropy can be used to evaluate the quality of a CRF model. Lower entropy on the test
data typically indicates better model performance.

In the context of deep learning, CRFs are often used as a structured output layer in neural networks, and
the entropy is computed over the conditional distribution of labels given the input features. This can be
helpful in various applications, especially when dealing with tasks that require handling complex
dependencies and structured output.

Overall, entropy in CRFs serves as a useful tool for managing uncertainty and enhancing the
performance of deep learning models, particularly in sequence labeling tasks.

24

You might also like