Professional Documents
Culture Documents
114021
114021
ADVANCED LEARNING
REINFORCEMENT LEARNING:
There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value function, which is
the maximum value at a state under any policy. Therefore, the agent expects
the long-term return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries to
apply such a policy that the action performed in each step helps to maximize
the future reward.
The policy-based approach has mainly two types of policy:
To understand the working process of the RL, we need to consider two main things:
○ Environment: It can be anything such as a room, maze, football ground, etc.
Let's take an example of a maze environment that the agent needs to explore.
Consider the below image:
In the above image, the agent is at the very first block of the maze. The maze is
consisting of an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S4
block, then get the +1 reward; if it reaches the fire pit, then gets -1 reward point. It
can take four actions: move up, move down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in
possible fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he
will get the +1-reward point.
The agent will try to remember the preceding steps that it has taken to reach the final
step. To memorize the steps, it assigns 1 value to each previous step. Consider the
below step:
Now, the agent has successfully stored the previous steps assigning the 1 value to
each previous block. But what will the agent do if he starts moving from the block,
which has 1 value block on both sides? Consider the below diagram:
It will be a difficult condition for the agent whether he should go up or down as each
block has the same value. So, the above approach is not suitable for the agent to
reach the destination. Hence to solve the problem, we will use the Bellman
equation, which is the main concept behind reinforcement learning.
○ The reward/feedback obtained for each good and bad action is "R."
Where,
γ = Discount factor
In the above equation, we are taking the max of the complete values because the
agent tries to find the optimal solution always.
So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to
move.
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because
there is no reward at this state also.
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because
there is no reward at this state also.
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because
there is no reward at this state also.
○ Positive Reinforcement
○ Negative Reinforcement
Positive Reinforcement:
This type of reinforcement can sustain the changes for a long time, but too much
positive reinforcement may lead to an overload of states that can reduce the
consequences.
Negative Reinforcement:
The negative reinforcement learning is opposite to the positive reinforcement as it
increases the tendency that the specific behavior will occur again by avoiding the
negative condition.
○ Q-Learning:
○ It learns the value function Q (S, a), which means how good to take
action "a" at a particular state "s."
○ In SARSA, new action and reward are selected using the same policy,
which has determined the original action.
○ SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.
Q-Learning Explanation:
○ The main objective of Q-learning is to learn the policy which can inform
the agent what actions should be taken for maximizing the reward under
what circumstances.
○ The value of Q-learning can be derived from the Bellman equation. Consider
the Bellman equation given below:
In the equation, we have various components, including reward, discount factor (γ),
probability, and end states s'. But there is no any Q-value is given so first consider the
below image:
In the above image, we can see there is an agent who has three values options, V(s1),
V(s2), V(s3). As this is MDP, the agent only cares for the current state and the future
state. The agent can go in any direction (Up, Left, or Right), so he needs to decide
where to go for the optimal path. Here the agent will take a move as per probability
bases and change the state. But if we want some exact moves, for this, we need to
make some changes in terms of Q-value. Consider the below image:
Q- represents the quality of the actions at each state. So instead of using a value at
each state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies which
action is more lubricated than others, and according to the best Q-value, the agent
takes his next move. The Bellman equation can be used for deriving the Q-value.
To perform any action, the agent will get a reward R(s, a), and also he will end up on
a certain state, so the Q -value equation will be:
Q-table:
A Q-table or matrix is created while performing the Q-learning. The table follows the
state and action pair, i.e., [s, a], and initializes the values to zero. After each action,
the table is updated, and the q-values are stored within the table.
The RL agent uses this Q-table as a reference table to select the best action based on
the q-values.
1. Robotics:
2. Control:
3. Game Playing:
4. Chemistry:
5. Business:
6. Manufacturing:
a. In various automobile manufacturing companies, the robots use deep
reinforcement learning to pick goods and put them in some containers.
7. Finance Sector:
REPRESENTATION LEARNING:
In representation learning, data is sent into the machine, and it learns the
representation on its own. It is a way of determining a data representation of the
features, the distance function, and the similarity function that determines how the
predictive model will perform. Representation learning works by reducing
high-dimensional data to low-dimensional data, making it easier to discover patterns
and anomalies while also providing a better understanding of the data’s overall
behaviour.
Basically, Machine learning tasks such as classification frequently demand input that
is mathematically and computationally convenient to process, which motivates
representation learning. Real-world data, such as photos, video, and sensor data, has
resisted attempts to define certain qualities algorithmically. An approach is to
examine the data for such traits or representations rather than depending on explicit
techniques.
Supervised Learning
This is referred to as supervised learning when the ML or DL model maps the input
X to the output Y. The computer tries to correct itself by comparing model output to
ground truth, and the learning process optimizes the mapping from input to output.
This process is repeated until the optimization function reaches global minima.
Even when the optimization function reaches the global minima, new data does not
always perform well, resulting in overfitting. While supervised learning does not
necessitate a significant amount of data to learn the mapping from input to output, it
does necessitate the learned features. The prediction accuracy can improve by up to
17 percent when the learned attributes are incorporated into the supervised learning
algorithm.
Using labelled input data, features are learned in supervised feature learning.
Supervised neural networks, multilayer perceptrons, and (supervised) dictionary
learning are some examples.
Unsupervised Learning
Unsupervised learning is a sort of machine learning in which the labels are ignored in
favour of the observation itself. Unsupervised learning isn’t used for classification or
regression; instead, it’s used to uncover underlying patterns, cluster data, denoise it,
detect outliers, and decompose data, among other things.
When working with data x, we must be very careful about whatever features z we use
to ensure that the patterns produced are accurate. It has been observed that having
more data does not always imply having better representations. We must be careful to
develop a model that is both flexible and expressive so that the extracted features can
convey critical information.
Unsupervised feature learning learns features from unlabeled input data by following
the methods such as Dictionary learning, independent component analysis,
autoencoders, matrix factorization, and various forms of clustering are among
examples.
In the next section, we will see more about these methods and workflow, how they
learn the representation in detail.
Supervised Methods
Multi-Layer Perceptron
The perceptron is the most basic neural unit, consisting of a succession of inputs and
weights that are compared to the ground truth. A multi-layer perceptron, or MLP, is a
feed-forward neural network made up of layers of perceptron units. MLP is made up
of three-node layers: an input, a hidden layer, and an output layer. MLP is commonly
referred to as the vanilla neural network because it is a very basic artificial neural
network.
This notion serves as a foundation for hidden variables and representation learning.
Our goal in this theorem is to determine the variables or required weights that can
represent the underlying distribution of the entire data so that when we plug those
variables or required weights into unknown data, we receive results that are almost
identical to the original data. In a word, artificial neural networks (ANN) assist us in
extracting meaningful patterns from a dataset.
Neural Networks
Neural networks are a class of learning algorithms that employ a “network” of
interconnected nodes in various layers. It’s based on the animal nervous system, with
nodes resembling neurons and edges resembling synapses. The network establishes
computational rules for passing input data from the network’s input layer to the
network’s output layer, and each edge has an associated weight.
The relationship between the input and output layers, which is parameterized by the
weights, is described by a network function associated with a neural network.
Various learning tasks can be achieved by minimizing a cost function over the
network function (w) with correctly defined network functions.
Unsupervised Methods
Learning Representation from unlabeled data is referred to as unsupervised feature
learning. Unsupervised Representation learning frequently seeks to uncover
low-dimensional features that encapsulate some structure beneath the
high-dimensional input data.
K-Means Clustering
K-means clustering is a vector quantization approach. An n-vector set is divided into
k clusters (i.e. subsets) via K-means clustering, with each vector belonging to the
cluster with the closest mean. Despite the use of inferior greedy techniques, the
problem is computationally NP-hard.
There are two major steps in LLE. The first step is “neighbour-preserving,” in which
each input data point Xi is reconstructed as a weighted sum of K nearest neighbour
data points, with the optimal weights determined by minimizing the average squared
reconstruction error (i.e., the difference between an input point and its reconstruction)
while keeping the weights associated with each point equal to one.
The second stage involves “dimension reduction,” which entails searching for vectors
in a lower-dimensional space that reduce the representation error while still using the
optimal weights from the previous step.
The weights are optimized given fixed data in the first stage, which can be solved as
a least-squares problem. Lower-dimensional points are optimized with fixed weights
in the second phase, which can be solved using sparse eigenvalue decomposition.
When the number of vocabulary items exceeds the dimension of the input data,
sparse coding can be used to learn overcomplete dictionaries. K-SVD is an algorithm
for learning a dictionary of elements that allows for sparse representation.
Autoencoders
Deep network representations have been found to be insensitive to complex noise or
data conflicts. This can be linked to the architecture to some extent. The employment
of convolutional layers and max-pooling, for example, can be proven to produce
transformation insensitivity.
NEURAL NETWORKS:
Each processing node has its own small sphere of knowledge, including what it has
seen and any rules it was originally programmed with or developed for itself. The
tiers are highly interconnected, which means each node in tier n will be connected to
many nodes in tier n-1 -- its inputs -- and in tier n+1, which provides input data for
those nodes. There may be one or multiple nodes in the output layer, from which the
answer it produces can be read.
Artificial neural networks are notable for being adaptive, which means they modify
themselves as they learn from initial training and subsequent runs provide more
information about the world. The most basic learning model is centered on weighting
the input streams, which is how each node weights the importance of input data from
each of its predecessors. Inputs that contribute to getting right answers are weighted
higher.
Input Layer
Information from the outside world enters the artificial neural network from the input
layer. Input nodes process the data, analyze or categorize it, and pass it on to the next
layer.
Hidden Layer
Hidden layers take their input from the input layer or other hidden layers. Artificial
neural networks can have a large number of hidden layers. Each hidden layer
analyzes the output from the previous layer, processes it further, and passes it on to
the next layer.
Output Layer
The output layer gives the final result of all the data processing by the artificial
neural network. It can have single or multiple nodes. For instance, if we have a
binary (yes/no) classification problem, the output layer will have one output node,
which will give the result as 1 or 0. However, if we have a multi-class classification
problem, the output layer might consist of more than one output node.
Deep neural networks, or deep learning networks, have several hidden layers with
millions of artificial neurons linked together. A number, called weight, represents the
connections between one node and another. The weight is a positive number if one
node excites another, or negative if one node suppresses the other. Nodes with higher
weight values have more influence on the other nodes.
Theoretically, deep neural networks can map any input type to any output type.
However, they also need much more training as compared to other machine learning
methods. They need millions of examples of training data rather than perhaps the
hundreds or thousands that a simpler network might need.
Artificial neural networks can be categorized by how the data flows from the input
node to the output node. Below are some examples:
Feedforward neural networks process data in one direction, from the input node to
the output node. Every node in one layer is connected to every node in the next layer.
A feedforward network uses a feedback process to improve predictions over time.
Backpropagation algorithm
1. Each node makes a guess about the next node in the path.
2. It checks if the guess was correct. Nodes assign higher weight values to paths
that lead to more correct guesses and lower weight values to node paths that
lead to incorrect guesses.
3. For the next data point, the nodes make a new prediction using the higher
weight paths and then repeat Step 1.
● Parallel processing abilities mean the network can perform more than one
job at a time.
● Fault tolerance means the corruption of one or more cells of the ANN will
not stop the generation of output.
● Gradual corruption means the network will slowly degrade over time,
instead of a problem destroying the network instantly.
● The ability to produce output with incomplete knowledge with the loss of
performance being based on how important the missing information is.
● No restrictions are placed on the input variables, such as how they should
be distributed.
● Machine learning means the ANN can learn from events and make
decisions based on the observations.
● The lack of rules for determining the proper network structure means the
appropriate artificial neural network architecture can only be found through
trial and error and experience.
Image recognition was one of the first areas to which neural networks were
successfully applied, but the technology uses have expanded to many more areas,
including:
● Chatbots
These are just a few specific areas to which neural networks are being applied today.
Prime uses involve any process that operates according to strict rules or patterns and
has large amounts of data. If the data involved is too large for a human to make sense
of in a reasonable amount of time, the process is likely a prime candidate for
automation through artificial neural networks.
ACTIVE LEARNING:
Active learning is the name used for the process of prioritising the data which needs
Active learning can be used in situations where the amount of data is too large to be
labelled and some priority needs to be made to label the data in a smart way.
But, why don’t we just choose a random subset of data to manually label them?
Let’s look at a very simple example to motivate the discussion. Assume we have
millions of data points which need to be classified based on two features. The actual
As one can see, both classes (red and purple) can quite nicely be separated by a
vertical blue line crossing at 0. The problem is that none of the data points are
Unfortunately, we don’t have enough time to label all of the data and we randomly
chose a subset of the data to label and train a binary classification model on it. The
result is not great, as the model prediction deviates quite a lot from the optimal
boundary.
This is where active learning can be used to optimise the data points chosen for
labelling and training a model based on them. The following plot shows an example
of training a binary classification model after choosing the training of the model
Making a smart choice of which data points to prioritise when labelling can save data
There are multiple approaches studied in the literature on how to prioritise data
points when labelling and how to iterate over the approach. We will nevertheless only
trained on it. The model is of course not going to be great but will help us
get some insight on which areas of the parameter space need to be labelled
3. After the model is trained, the model is used to predict the class of each
the model. In the next subsection we will present some of the possible
5. Once the best approach has been chosen to prioritise the labelling, this
labelled data set, which has been labelled based on the priority score. Once
the new model has been trained on the subset of data, the unlabelled data
points can be ran through the model to update the prioritisation scores to
continue labelling. In this way, one can keep optimising the labelling
Prioritisation scores
There are several approaches to assign a priority score to each data point. Below we
This is probably the most simple method. It takes the highest probability for each
data point’s prediction, and sorts them from smaller to larger. The actual expression
Let’s use an example to see how this would work. Assume we have the following
point, hence:
● X1: 0.9
● X2: 0.87
● X3:0.5
● X4:0.99.
The second step is to sort the data based on this maximum probability (from smaller
Margin sampling:
This method takes into account the difference between the highest probability and the
second highest probability. Formally, the expression to prioritise would look like:
The data points with the lower margin sampling score would be the ones labelled the
first; these are the data points the model is least certain about between the most
Following the example of Table 1, the corresponding scores for each data point are:
Hence the data points would be shown to label as follows: X3, X1, X2 and X4. As
one can see the priority in this case is slightly different to the least confident one.
Entropy:
Finally, the last scoring function that we are gonna present here is the entropy score.
The higher the entropy the more disorder there is, whereas if the entropy is low, it
means that the gas might be mainly in one particular area such as a corner of the box
(maybe when the experiment started, before expanding across the box).
This concept can be reused to measure the certainty of a model. If a model is highly
certain about a class for a given data point, it will probably have a high certainty for a
particular class, whereas all the other classes will have low probability. Isn’t this very
similar to having a gas in the corner of a box? In this case we have most of the
probability assigned to a particular class. In the case of high entropy it would mean
that the model distributes equally the probability for all classes as it is not certain at
all which class that data point belongs to, similarly to having the gas distributed
equally in all parts of the box. It is therefore straightforward to prioritise data points
Note that for X4, 0 should be changed for a small epsilon (e.g. 0.00001) for
numerical stability.
In this case the data points should be shown in the following order: X3, X2, X1 and
X4, which coincides with the order of the least confident scoring method.
ENSEMBLE LEARNING:
● Majority Vote
● Bagging and Random Forest
● Randomness Injection
● Feature-Selection Ensembles
● Error-Correcting Output Coding
● Boosting
● Stacking
BOOTSTRAP AGGREGATION:
Bagging, also known as bootstrap aggregation, is the ensemble learning method that
is commonly used to reduce variance within a noisy dataset. In bagging, a random
sample of data in a training set is selected with replacement—meaning that the
individual data points can be chosen more than once. After several data samples are
generated, these weak models are then trained independently, and depending on the
type of task—regression or classification, for example—the average or majority of
those predictions yield a more accurate estimate.
Ensemble learning
Ensemble learning gives credence to the idea of the “wisdom of crowds,” which
suggests that the decision-making of a larger group of people is typically better than
that of an individual expert. Similarly, ensemble learning refers to a group (or
ensemble) of base learners, or models, which work collectively to achieve a better
final prediction. A single model, also known as a base or weak learner, may not
perform well individually due to high variance or high bias. However, when weak
learners are aggregated, they can form a strong learner, as their combination reduces
bias or variance, yielding better model performance.
Ensemble methods are frequently illustrated using decision trees as this algorithm
can be prone to overfitting (high variance and low bias) when it hasn’t been pruned
and it can also lend itself to underfitting (low variance and high bias) when it’s very
small, like a decision stump, which is a decision tree with one level. Remember,
when an algorithm overfits or underfits to its training set, it cannot generalize well to
new datasets, so ensemble methods are used to counteract this behavior to allow for
generalization of the model to new datasets. While decision trees can exhibit high
variance or high bias, it’s worth noting that it is not the only modeling technique that
leverages ensemble learning to find the “sweet spot” within the bias-variance
tradeoff.
Bagging and boosting are two main types of ensemble learning methods. The main
difference between these learning methods is the way in which they are trained. In
bagging, weak learners are trained in parallel, but in boosting, they learn sequentially.
This means that a series of models are constructed and with each new model
iteration, the weights of the misclassified data in the previous model are increased.
This redistribution of weights helps the algorithm identify the parameters that it
needs to focus on to improve its performance. AdaBoost, which stands for “adaptive
boosting algorithm,” is one of the most popular boosting algorithms as it was one of
the first of its kind. Other types of boosting algorithms include XGBoost,
GradientBoost, and BrownBoost.
Another difference in which bagging and boosting differ are the scenarios in which
they are used. For example, bagging methods are typically used on weak learners
which exhibit high variance and low bias, whereas boosting methods are leveraged
when low variance and high bias is observed.
How bagging works
In 1996, Leo Breiman introduced the bagging algorithm, which has three basic steps:
There are a number of key advantages and challenges that the bagging method
presents when used for classification or regression problems. The key benefits of
bagging include:
Loss of interpretability: It’s difficult to draw very precise business insights through
bagging because of the averaging involved across predictions. While the output is
more precise than any individual data point, a more accurate or complete dataset
could also yield more precision within a single classification or regression model.
Computationally expensive: Bagging slows down and grows more intensive as the
number of iterations increases. Thus, it’s not well-suited for real-time applications.
Clustered systems or a large number of processing cores are ideal for quickly
creating bagged ensembles on large test sets.
Less flexible: As a technique, bagging works particularly well with algorithms that
are less stable. One that is more stable or subject to high amounts of bias do not
provide as much benefit as there’s less variation within the dataset of the model.
Applications of Bagging
The bagging technique is used across a large number of industries, providing insights
for both real-world value and interesting perspectives, such as in the GRAMMY
Debates with Watson. Key use cases include:
● Healthcare: Bagging has been used to form medical data predictions. For
example, ensemble methods have been used for an array of bioinformatics problems,
such as gene and/or protein selection to identify a specific trait of interest.
● IT: Bagging can also improve the precision and accuracy in IT systems, such
as ones network intrusion detection systems.
● Environment: Ensemble methods, such as bagging, have been applied within
the field of remote sensing.
● Finance: Bagging has also been leveraged with deep learning models in the
finance industry, automating critical tasks, including fraud detection, credit risk
evaluations, and option pricing problems.
BOOSTING:
Weak learners
Weak learners have low prediction accuracy, similar to random guessing. They are
prone to overfitting—that is, they can't classify data that varies too much from their
original dataset. For example, if you train the model to identify cats as animals with
pointed ears, it might fail to recognize a cat whose ears are curled.
Strong learners
Strong learners have higher prediction accuracy. Boosting converts a system of weak
learners into a single strong learning system. For example, to identify the cat image,
it combines a weak learner that guesses for pointy ears and another learner that
guesses for cat-shaped eyes. After analyzing the animal image for pointy ears, the
system analyzes it once again for cat-shaped eyes. This improves the system's overall
accuracy.
To understand how boosting works, let's describe how machine learning models
make decisions. Although there are many variations in implementation, data
scientists often use boosting with decision-tree algorithms:
Decision trees
Decision trees are data structures in machine learning that work by dividing the
dataset into smaller and smaller subsets based on their features. The idea is that
decision trees split up the data repeatedly until there is only one class left. For
example, the tree may ask a series of yes or no questions and divide the data into
categories at every step.
Boosting and bagging are the two common ensemble methods that improve
prediction accuracy. The main difference between these learning methods is the
method of training. In bagging, data scientists improve the accuracy of weak learners
by training several of them at once on multiple datasets. In contrast, boosting trains
weak learners one after another.
The training method varies depending on the type of boosting process called the
boosting algorithm. However, an algorithm takes the following general steps to train
the boosting model:
Step 1
The boosting algorithm assigns equal weight to each data sample. It feeds the data to
the first machine model, called the base algorithm. The base algorithm makes
predictions for each data sample.
Step 2
The boosting algorithm assesses model predictions and increases the weight of
samples with a more significant error. It also assigns a weight based on model
performance. A model that outputs excellent predictions will have a high amount of
influence over the final decision.
Step 3
The algorithm passes the weighted data to the next decision tree.
Step 4
The algorithm repeats steps 2 and 3 until instances of training errors are below a
certain threshold.
Adaptive boosting
Adaptive Boosting (AdaBoost) was one of the earliest boosting models developed. It
adapts and tries to self-correct in every iteration of the boosting process.
AdaBoost initially gives the same weight to each dataset. Then, it automatically
adjusts the weights of the data points after every decision tree. It gives more weight
to incorrectly classified items to correct them for the next round. It repeats the
process until the residual error, or the difference between actual and predicted values,
falls below an acceptable threshold.
You can use AdaBoost with many predictors, and it is typically not as sensitive as
other boosting algorithms. This approach does not work well when there is a
correlation among features or high data dimensionality. Overall, AdaBoost is a
suitable type of boosting for classification problems.
Gradient boosting
Gradient Boosting (GB) is similar to AdaBoost in that it, too, is a sequential training
technique. The difference between AdaBoost and GB is that GB does not give
incorrectly classified items more weight. Instead, GB software optimizes the loss
function by generating base learners sequentially so that the present base learner is
always more effective than the previous one. This method attempts to generate
accurate results initially instead of correcting errors throughout the process, like
AdaBoost. For this reason, GB software can lead to more accurate results. Gradient
Boosting can help with both classification and regression-based problems.
Ease of implementation
Reduction of bias
Boosting models are vulnerable to outliers or data values that are different from the
rest of the dataset. Because each model attempts to correct the faults of its
predecessor, outliers can skew results significantly.
Real-time implementation
You might also find it challenging to use boosting for real-time implementation
because the algorithm is more complex than other processes. Boosting methods have
high adaptability, so you can use a wide variety of model parameters that
immediately affect the model's performance.
The ensemble consists of N trees. Tree1 is trained using the feature matrix X and the
labels y. The predictions labelled y1(hat) are used to determine the training set
residual errors r1. Tree2 is then trained using the feature matrix X and the residual
errors of Tree1 as labels. The predicted results r1(hat) are then used to determine the
residual r2. The process is repeated until all the N trees forming the ensemble are
trained.
Deep Learning
Deep learning is a subset of machine learning, which is essentially a neural network
with three or more layers. These neural networks attempt to simulate the behavior of
the human brain—albeit far from matching its ability—allowing it to “learn” from
large amounts of data.
Architectures :
1. Deep Neural Network – It is a neural network with a certain level of
complexity (having multiple hidden layers in between input and output
layers). They are capable of modeling and processing non-linear
relationships.
2. Deep Belief Network(DBN) – It is a class of Deep Neural Network. It is a
multi-layer belief network. Steps for performing DBN : a. Learn a layer of
features from visible units using the Contrastive Divergence algorithm. b.
Treat activations of previously trained features as visible units and then
learn features of features. c. Finally, the whole DBN is trained when the
learning for the final hidden layer is achieved.
3. Recurrent (perform same task for every element of a sequence) Neural
Network – Allows for parallel and sequential computation. Similar to the
human brain (large feedback network of connected neurons). They are able
to remember important things about the input they received and hence
enables them to be more precise.