Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

UNIT V

ADVANCED LEARNING

Reinforcement Learning – Representation Learning – Neural Networks – Active


Learning – Ensemble Learning – Bootstrap Aggregation – Boosting – Gradient
Boosting Machines – Deep Learning

REINFORCEMENT LEARNING:

Reinforcement Learning is a feedback-based Machine learning


technique in which an agent learns to behave in an environment by performing
the actions and seeing the results of actions. For each good action, the agent gets
positive feedback, and for each bad action, the agent gets negative feedback or
penalty.
​In Reinforcement Learning, the agent learns automatically using feedbacks without
any labeled data, unlike supervised learning.
​Since there is no labeled data, so the agent is bound to learn by its experience only.
​RL solves a specific type of problem where decision making is sequential, and the
goal is long-term, such as game-playing, robotics, etc.
​ he agent interacts with the environment and explores it by itself. The primary goal
T
of an agent in reinforcement learning is to improve the performance by getting the
maximum positive rewards.
​The agent learns with the process of hit and trial, and based on the experience, it
learns to perform the task in a better way. Hence, we can say that "Reinforcement
learning is a type of machine learning method where an intelligent agent
(computer program) interacts with the environment and learns to act within
that." How a Robotic dog learns the movement of his arms is an example of
Reinforcement learning.
I​ t is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns
from its own experience without any human intervention.
​Example: Suppose there is an AI agent present within a maze environment, and his
goal is to find the diamond. The agent interacts with the environment by performing
some actions, and based on those actions, the state of the agent gets changed, and it
also receives a reward or penalty as feedback.
​The agent continues doing these three things (take action, change state/remain in
the same state, and get feedback), and by doing these actions, he learns and
explores the environment.
​ he agent learns that what actions lead to positive feedback or rewards and what
T
actions lead to negative feedback penalty. As a positive reward, the agent gets a
positive point, and as a penalty, it gets a negative point.

Approaches to implement Reinforcement Learning:

There are mainly three ways to implement reinforcement-learning in ML, which are:

1. Value-based:
The value-based approach is about to find the optimal value function, which is
the maximum value at a state under any policy. Therefore, the agent expects
the long-term return at any state(s) under policy π.

2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries to
apply such a policy that the action performed in each step helps to maximize
the future reward.
The policy-based approach has mainly two types of policy:

○ Deterministic: The same action is produced by the policy (π) at any


state.

○ Stochastic: In this policy, probability determines the produced action.

3. Model-based: In the model-based approach, a virtual model is created for the


environment, and the agent explores that environment to learn it. There is no
particular solution or algorithm for this approach because the model
representation is different for each environment.

Reinforcement Learning Working:

To understand the working process of the RL, we need to consider two main things:
○ Environment: It can be anything such as a room, maze, football ground, etc.

○ Agent: An intelligent agent such as AI robot.

Let's take an example of a maze environment that the agent needs to explore.
Consider the below image:

In the above image, the agent is at the very first block of the maze. The maze is
consisting of an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.

The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S4
block, then get the +1 reward; if it reaches the fire pit, then gets -1 reward point. It
can take four actions: move up, move down, move left, and move right.

The agent can take any path to reach to the final point, but he needs to make it in
possible fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he
will get the +1-reward point.
The agent will try to remember the preceding steps that it has taken to reach the final
step. To memorize the steps, it assigns 1 value to each previous step. Consider the
below step:

Now, the agent has successfully stored the previous steps assigning the 1 value to
each previous block. But what will the agent do if he starts moving from the block,
which has 1 value block on both sides? Consider the below diagram:
It will be a difficult condition for the agent whether he should go up or down as each
block has the same value. So, the above approach is not suitable for the agent to
reach the destination. Hence to solve the problem, we will use the Bellman
equation, which is the main concept behind reinforcement learning.

The Bellman Equation:

The Bellman equation was introduced by the Mathematician Richard Ernest


Bellman in the year 1953, and hence it is called as a Bellman equation. It is
associated with dynamic programming and used to calculate the values of a decision
problem at a certain point by including the values of previous states.

It is a way of calculating the value functions in dynamic programming or


environment that leads to modern reinforcement learning.

The key-elements used in Bellman equations are:


○ Action performed by the agent is referred to as "a"

○ State occurred by performing the action "s."

○ The reward/feedback obtained for each good and bad action is "R."

○ A discount factor is Gamma "γ."

The Bellman equation can be written as:

1. V(s) = max [R(s,a) + γV(s`)]

Where,

V(s)= value calculated at a particular point.

R(s,a) = Reward at a particular state s by performing an action.

γ = Discount factor

V(s`) = The value at the previous state.

In the above equation, we are taking the max of the complete values because the
agent tries to find the optimal solution always.

So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.

For 1st block:

V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to
move.

V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.

For 2nd block:


V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because
there is no reward at this state.

V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9

For 3rd block:

V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because
there is no reward at this state also.

V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81

For 4th block:

V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because
there is no reward at this state also.

V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73

For 5th block:

V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because
there is no reward at this state also.

V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66

Consider the below image:


Now, we will move further to the 6th block, and here agent may change the route
because it always tries to find the optimal path. So now, let's consider from the block
next to the fire pit.
Now, the agent has three options to move; if he moves to the blue box, then he will
feel a bump if he moves to the fire pit, then he will get the -1 reward. But here we are
taking only positive rewards, so for this, he will move upwards only. The complete
block values will be calculated using this formula. Consider the below image:
Types of Reinforcement learning:

There are mainly two types of reinforcement learning, which are:

○ Positive Reinforcement

○ Negative Reinforcement

Positive Reinforcement:

The positive reinforcement learning means adding something to increase the


tendency that expected behavior would occur again. It impacts positively on the
behavior of the agent and increases the strength of the behavior.

This type of reinforcement can sustain the changes for a long time, but too much
positive reinforcement may lead to an overload of states that can reduce the
consequences.

Negative Reinforcement:
The negative reinforcement learning is opposite to the positive reinforcement as it
increases the tendency that the specific behavior will occur again by avoiding the
negative condition.

It can be more effective than positive reinforcement depending on situation and


behavior, but it provides reinforcement only to meet minimum behavior.

Reinforcement Learning Algorithms:

Reinforcement learning algorithms are mainly used in AI applications and gaming


applications. The main used algorithms are:

○ Q-Learning:

○ Q-learning is an Off policy RL algorithm, which is used for temporal


difference Learning. The temporal difference learning methods are the
way of comparing temporally successive predictions.

○ It learns the value function Q (S, a), which means how good to take
action "a" at a particular state "s."

○ The below flowchart explains the working of Q- learning:


○ State Action Reward State action (SARSA):

○ SARSA stands for State Action Reward State action, which is an


on-policy temporal difference learning method. The on-policy control
method selects the action for each state while learning using a specific
policy.

○ The goal of SARSA is to calculate the Q π (s, a) for the selected


current policy π and all pairs of (s-a).

○ The main difference between Q-learning and SARSA algorithms is that


unlike Q-learning, the maximum reward for the next state is not
required for updating the Q-value in the table.

○ In SARSA, new action and reward are selected using the same policy,
which has determined the original action.

○ SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.

○ Deep Q Neural Network (DQN):

○ As the name suggests, DQN is Q-learning using Neural networks.

○ For a big state space environment, it will be a challenging and complex


task to define and update a Q-table.

○ To solve such an issue, we can use a DQN algorithm. Where, instead of


defining a Q-table, neural network approximates the Q-values for each
action and state.

Now, we will expand Q-learning.

Q-Learning Explanation:

○ Q-learning is a popular model-free reinforcement learning algorithm based on


the Bellman equation.

○ The main objective of Q-learning is to learn the policy which can inform
the agent what actions should be taken for maximizing the reward under
what circumstances.

○ It is an off-policy RL that attempts to find the best action to take at a current


state.

○ The goal of the agent in Q-learning is to maximize the value of Q.

○ The value of Q-learning can be derived from the Bellman equation. Consider
the Bellman equation given below:
In the equation, we have various components, including reward, discount factor (γ),
probability, and end states s'. But there is no any Q-value is given so first consider the
below image:

In the above image, we can see there is an agent who has three values options, V(s1),
V(s2), V(s3). As this is MDP, the agent only cares for the current state and the future
state. The agent can go in any direction (Up, Left, or Right), so he needs to decide
where to go for the optimal path. Here the agent will take a move as per probability
bases and change the state. But if we want some exact moves, for this, we need to
make some changes in terms of Q-value. Consider the below image:
Q- represents the quality of the actions at each state. So instead of using a value at
each state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies which
action is more lubricated than others, and according to the best Q-value, the agent
takes his next move. The Bellman equation can be used for deriving the Q-value.

To perform any action, the agent will get a reward R(s, a), and also he will end up on
a certain state, so the Q -value equation will be:

Hence, we can say that, V(s) = max [Q(s, a)]

The above formula is used to estimate the Q-values in Q-Learning.

What is 'Q' in Q-learning?


The Q stands for quality in Q-learning, which means it specifies the quality of an
action taken by the agent.

Q-table:

A Q-table or matrix is created while performing the Q-learning. The table follows the
state and action pair, i.e., [s, a], and initializes the values to zero. After each action,
the table is updated, and the q-values are stored within the table.

The RL agent uses this Q-table as a reference table to select the best action based on
the q-values.

Reinforcement Learning Applications:

1. Robotics:

a. RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.

2. Control:

a. RL can be used for adaptive control such as Factory processes,


admission control in telecommunication, and Helicopter pilot is an
example of reinforcement learning.

3. Game Playing:

a. RL can be used in Game playing such as tic-tac-toe, chess, etc.

4. Chemistry:

a. RL can be used for optimizing the chemical reactions.

5. Business:

a. RL is now used for business strategy planning.

6. Manufacturing:
a. In various automobile manufacturing companies, the robots use deep
reinforcement learning to pick goods and put them in some containers.

7. Finance Sector:

a. The RL is currently used in the finance sector for evaluating trading


strategies.

REPRESENTATION LEARNING:

Representation learning is a very important aspect of machine learning which


automatically discovers the feature patterns in the data. When the machine is
provided with the data, it learns the representation itself without any human
intervention. The goal of representation learning is to train machine learning
algorithms to learn useful representations, such as those that are interpretable,
incorporate latent features, or can be used for transfer learning.

In representation learning, data is sent into the machine, and it learns the
representation on its own. It is a way of determining a data representation of the
features, the distance function, and the similarity function that determines how the
predictive model will perform. Representation learning works by reducing
high-dimensional data to low-dimensional data, making it easier to discover patterns
and anomalies while also providing a better understanding of the data’s overall
behaviour.

Basically, Machine learning tasks such as classification frequently demand input that
is mathematically and computationally convenient to process, which motivates
representation learning. Real-world data, such as photos, video, and sensor data, has
resisted attempts to define certain qualities algorithmically. An approach is to
examine the data for such traits or representations rather than depending on explicit
techniques.

Methods of Representation Learning


We must employ representation learning to ensure that the model provides invariant
and untangled outcomes in order to increase its accuracy and performance. In this
section, we’ll look at how representation learning can improve the model’s
performance in three different learning frameworks: supervised learning,
unsupervised learning.

Supervised Learning
This is referred to as supervised learning when the ML or DL model maps the input
X to the output Y. The computer tries to correct itself by comparing model output to
ground truth, and the learning process optimizes the mapping from input to output.
This process is repeated until the optimization function reaches global minima.

Even when the optimization function reaches the global minima, new data does not
always perform well, resulting in overfitting. While supervised learning does not
necessitate a significant amount of data to learn the mapping from input to output, it
does necessitate the learned features. The prediction accuracy can improve by up to
17 percent when the learned attributes are incorporated into the supervised learning
algorithm.

Using labelled input data, features are learned in supervised feature learning.
Supervised neural networks, multilayer perceptrons, and (supervised) dictionary
learning are some examples.

Unsupervised Learning
Unsupervised learning is a sort of machine learning in which the labels are ignored in
favour of the observation itself. Unsupervised learning isn’t used for classification or
regression; instead, it’s used to uncover underlying patterns, cluster data, denoise it,
detect outliers, and decompose data, among other things.

When working with data x, we must be very careful about whatever features z we use
to ensure that the patterns produced are accurate. It has been observed that having
more data does not always imply having better representations. We must be careful to
develop a model that is both flexible and expressive so that the extracted features can
convey critical information.
Unsupervised feature learning learns features from unlabeled input data by following
the methods such as Dictionary learning, independent component analysis,
autoencoders, matrix factorization, and various forms of clustering are among
examples.

In the next section, we will see more about these methods and workflow, how they
learn the representation in detail.

Supervised Methods

Supervised Dictionary Learning


Dictionary learning creates a set of representative elements (dictionary) from the
input data, allowing each data point to be represented as a weighted sum of the
representative elements. By minimizing the average representation error (across the
input data) and applying L1 regularization to the weights, the dictionary items and
weights may be obtained i.e., the representation of each data point has only a few
nonzero weights.

For optimizing dictionary elements, supervised dictionary learning takes advantage


of both the structure underlying the input data and the labels. The supervised
dictionary learning technique uses dictionary learning to solve classification issues
by optimizing dictionary elements, data point weights, and classifier parameters
based on the input data.

A minimization problem is formulated, with the objective function consisting of the


classification error, the representation error, an L1 regularization on the representing
weights for each data point (to enable sparse data representation), and an L2
regularization on the parameters of the classification algorithm.

Multi-Layer Perceptron
The perceptron is the most basic neural unit, consisting of a succession of inputs and
weights that are compared to the ground truth. A multi-layer perceptron, or MLP, is a
feed-forward neural network made up of layers of perceptron units. MLP is made up
of three-node layers: an input, a hidden layer, and an output layer. MLP is commonly
referred to as the vanilla neural network because it is a very basic artificial neural
network.

This notion serves as a foundation for hidden variables and representation learning.
Our goal in this theorem is to determine the variables or required weights that can
represent the underlying distribution of the entire data so that when we plug those
variables or required weights into unknown data, we receive results that are almost
identical to the original data. In a word, artificial neural networks (ANN) assist us in
extracting meaningful patterns from a dataset.

Neural Networks
Neural networks are a class of learning algorithms that employ a “network” of
interconnected nodes in various layers. It’s based on the animal nervous system, with
nodes resembling neurons and edges resembling synapses. The network establishes
computational rules for passing input data from the network’s input layer to the
network’s output layer, and each edge has an associated weight.

The relationship between the input and output layers, which is parameterized by the
weights, is described by a network function associated with a neural network.
Various learning tasks can be achieved by minimizing a cost function over the
network function (w) with correctly defined network functions.

Unsupervised Methods
Learning Representation from unlabeled data is referred to as unsupervised feature
learning. Unsupervised Representation learning frequently seeks to uncover
low-dimensional features that encapsulate some structure beneath the
high-dimensional input data.

K-Means Clustering
K-means clustering is a vector quantization approach. An n-vector set is divided into
k clusters (i.e. subsets) via K-means clustering, with each vector belonging to the
cluster with the closest mean. Despite the use of inferior greedy techniques, the
problem is computationally NP-hard.

K-means clustering divides an unlabeled collection of inputs into k groups before


obtaining centroids-based features. These characteristics can be honed in a variety of
ways. The simplest method is to add k binary features to each sample, with each
feature j having a value of one of the k-means learned jth centroid is closest to the
sample under consideration. Cluster distances can be used as features after being
processed with a radial basis function.

Local Linear Embedding


LLE is a nonlinear learning strategy for constructing low-dimensional
neighbour-preserving representations from high-dimensional (unlabeled) input.
LLE’s main goal is to reconstruct high-dimensional data using lower-dimensional
points while keeping some geometric elements of the original data set’s neighbours.

There are two major steps in LLE. The first step is “neighbour-preserving,” in which
each input data point Xi is reconstructed as a weighted sum of K nearest neighbour
data points, with the optimal weights determined by minimizing the average squared
reconstruction error (i.e., the difference between an input point and its reconstruction)
while keeping the weights associated with each point equal to one.
The second stage involves “dimension reduction,” which entails searching for vectors
in a lower-dimensional space that reduce the representation error while still using the
optimal weights from the previous step.

The weights are optimized given fixed data in the first stage, which can be solved as
a least-squares problem. Lower-dimensional points are optimized with fixed weights
in the second phase, which can be solved using sparse eigenvalue decomposition.

Unsupervised Dictionary Mining


For optimizing dictionary elements, unsupervised dictionary learning does not use
data labels and instead relies on the structure underlying the data. Sparse coding,
which seeks to learn basic functions (dictionary elements) for data representation
from unlabeled input data, is an example of unsupervised dictionary learning.

When the number of vocabulary items exceeds the dimension of the input data,
sparse coding can be used to learn overcomplete dictionaries. K-SVD is an algorithm
for learning a dictionary of elements that allows for sparse representation.

Deep Architectures Methods


Deep learning architectures for feature learning are inspired by the hierarchical
architecture of the biological brain system, which stacks numerous layers of learning
nodes. The premise of distributed representation is typically used to construct these
architectures: observable data is generated by the interactions of many diverse
components at several levels.

Restricted Boltzmann Machine (RBMs)


In multilayer learning frameworks, RBMs (restricted Boltzmann machines) are
widely used as building blocks. An RBM is a bipartite undirected network having a
set of binary hidden variables, visible variables, and edges connecting the hidden and
visible nodes. It’s a variant of the more general Boltzmann machines, with the added
constraint of no intra-node connections. In an RBM, each edge has a weight assigned
to it. The connections and weights define an energy function that can be used to
generate a combined distribution of visible and hidden nodes.
For unsupervised representation learning, an RBM can be thought of as a single-layer
design. The visible variables, in particular, relate to the input data, whereas the
hidden variables correspond to the feature detectors. Hinton’s contrastive divergence
(CD) approach can be used to train the weights by maximizing the probability of
visible variables.

Autoencoders
Deep network representations have been found to be insensitive to complex noise or
data conflicts. This can be linked to the architecture to some extent. The employment
of convolutional layers and max-pooling, for example, can be proven to produce
transformation insensitivity.

Autoencoders are therefore neural networks that may be taught to do representation


learning. Autoencoders seek to duplicate their input to their output using an encoder
and a decoder. Autoencoders are typically trained via recirculation, a learning process
that compares the activation of the input network to the activation of the
reconstructed input.

NEURAL NETWORKS:

A neural network is a method in artificial intelligence that teaches computers to


process data in a way that is inspired by the human brain. It is a type of machine
learning process, called deep learning, that uses interconnected nodes or neurons in a
layered structure that resembles the human brain. It creates an adaptive system that
computers use to learn from their mistakes and improve continuously. Thus, artificial
neural networks attempt to solve complicated problems, like summarizing documents
or recognizing faces, with greater accuracy.

How artificial neural networks work?

An ANN usually involves a large number of processors operating in parallel and


arranged in tiers. The first tier receives the raw input information -- analogous to
optic nerves in human visual processing. Each successive tier receives the output
from the tier preceding it, rather than the raw input -- in the same way neurons
further from the optic nerve receive signals from those closer to it. The last tier
produces the output of the system.

Each processing node has its own small sphere of knowledge, including what it has
seen and any rules it was originally programmed with or developed for itself. The
tiers are highly interconnected, which means each node in tier n will be connected to
many nodes in tier n-1 -- its inputs -- and in tier n+1, which provides input data for
those nodes. There may be one or multiple nodes in the output layer, from which the
answer it produces can be read.

Artificial neural networks are notable for being adaptive, which means they modify
themselves as they learn from initial training and subsequent runs provide more
information about the world. The most basic learning model is centered on weighting
the input streams, which is how each node weights the importance of input data from
each of its predecessors. Inputs that contribute to getting right answers are weighted
higher.

Simple neural network architecture


A basic neural network has interconnected artificial neurons in three layers:

Input Layer

Information from the outside world enters the artificial neural network from the input
layer. Input nodes process the data, analyze or categorize it, and pass it on to the next
layer.

Hidden Layer

Hidden layers take their input from the input layer or other hidden layers. Artificial
neural networks can have a large number of hidden layers. Each hidden layer
analyzes the output from the previous layer, processes it further, and passes it on to
the next layer.

Output Layer

The output layer gives the final result of all the data processing by the artificial
neural network. It can have single or multiple nodes. For instance, if we have a
binary (yes/no) classification problem, the output layer will have one output node,
which will give the result as 1 or 0. However, if we have a multi-class classification
problem, the output layer might consist of more than one output node.

Deep neural network architecture

Deep neural networks, or deep learning networks, have several hidden layers with
millions of artificial neurons linked together. A number, called weight, represents the
connections between one node and another. The weight is a positive number if one
node excites another, or negative if one node suppresses the other. Nodes with higher
weight values have more influence on the other nodes.
Theoretically, deep neural networks can map any input type to any output type.
However, they also need much more training as compared to other machine learning
methods. They need millions of examples of training data rather than perhaps the
hundreds or thousands that a simpler network might need.

Types of neural networks:

Artificial neural networks can be categorized by how the data flows from the input
node to the output node. Below are some examples:

Feedforward neural networks

Feedforward neural networks process data in one direction, from the input node to
the output node. Every node in one layer is connected to every node in the next layer.
A feedforward network uses a feedback process to improve predictions over time.

Backpropagation algorithm

Artificial neural networks learn continuously by using corrective feedback loops to


improve their predictive analytics. In simple terms, you can think of the data flowing
from the input node to the output node through many different paths in the neural
network. Only one path is the correct one that maps the input node to the correct
output node. To find this path, the neural network uses a feedback loop, which works
as follows:

1. Each node makes a guess about the next node in the path.

2. It checks if the guess was correct. Nodes assign higher weight values to paths
that lead to more correct guesses and lower weight values to node paths that
lead to incorrect guesses.
3. For the next data point, the nodes make a new prediction using the higher
weight paths and then repeat Step 1.

Convolutional neural networks

The hidden layers in convolutional neural networks perform specific mathematical


functions, like summarizing or filtering, called convolutions. They are very useful for
image classification because they can extract relevant features from images that are
useful for image recognition and classification. The new form is easier to process
without losing features that are critical for making a good prediction. Each hidden
layer extracts and processes different image features, like edges, color, and depth.

Advantages of artificial neural networks

Advantages of artificial neural networks include:

● Parallel processing abilities mean the network can perform more than one
job at a time.

● Information is stored on an entire network, not just a database.

● The ability to learn and model nonlinear, complex relationships helps


model the real-life relationships between input and output.

● Fault tolerance means the corruption of one or more cells of the ANN will
not stop the generation of output.

● Gradual corruption means the network will slowly degrade over time,
instead of a problem destroying the network instantly.
● The ability to produce output with incomplete knowledge with the loss of
performance being based on how important the missing information is.

● No restrictions are placed on the input variables, such as how they should
be distributed.

● Machine learning means the ANN can learn from events and make
decisions based on the observations.

● The ability to learn hidden relationships in the data without commanding


any fixed relationship means an ANN can better model highly volatile data
and non-constant variance.

● The ability to generalize and infer unseen relationships on unseen data


means ANNs can predict the output of unseen data.

Disadvantages of artificial neural networks

The disadvantages of ANNs include:

● The lack of rules for determining the proper network structure means the
appropriate artificial neural network architecture can only be found through
trial and error and experience.

● The requirement of processors with parallel processing abilities makes


neural networks hardware-dependent.

● The network works with numerical information, therefore all problems


must be translated into numerical values before they can be presented to the
ANN.
● The lack of explanation behind probing solutions is one of the biggest
disadvantages in ANNs. The inability to explain the why or how behind the
solution generates a lack of trust in the network.

Applications of artificial neural networks

Image recognition was one of the first areas to which neural networks were
successfully applied, but the technology uses have expanded to many more areas,
including:

● Chatbots

● Natural language processing, translation and language generation

● Stock market prediction

● Delivery driver route planning and optimization

● Drug discovery and development

These are just a few specific areas to which neural networks are being applied today.
Prime uses involve any process that operates according to strict rules or patterns and
has large amounts of data. If the data involved is too large for a human to make sense
of in a reasonable amount of time, the process is likely a prime candidate for
automation through artificial neural networks.
ACTIVE LEARNING:

Active Learning is a special case of Supervised Machine Learning. This approach is


used to construct a high-performance classifier while keeping the size of the training
dataset to a minimum by actively selecting the valuable data points.

Active learning is the name used for the process of prioritising the data which needs

to be labelled in order to have the highest impact to training a supervised model.

Active learning can be used in situations where the amount of data is too large to be

labelled and some priority needs to be made to label the data in a smart way.

But, why don’t we just choose a random subset of data to manually label them?

Let’s look at a very simple example to motivate the discussion. Assume we have

millions of data points which need to be classified based on two features. The actual

solution is shown in the following plot:


Model prediction if all data points were labelled

As one can see, both classes (red and purple) can quite nicely be separated by a

vertical blue line crossing at 0. The problem is that none of the data points are

labelled, so the data is given to us as in the following plot:


Unlabelled data

Unfortunately, we don’t have enough time to label all of the data and we randomly

chose a subset of the data to label and train a binary classification model on it. The

result is not great, as the model prediction deviates quite a lot from the optimal

boundary.

Model trained on a random subset of labelled data points

This is where active learning can be used to optimise the data points chosen for

labelling and training a model based on them. The following plot shows an example

of training a binary classification model after choosing the training of the model

based on data points labelled after implementing active learning.


Model trained on a subset of data points chosen ho to be labelled using active
learning

Making a smart choice of which data points to prioritise when labelling can save data

science teams large amounts of time and computation

Active learning strategy

Steps for active learning

There are multiple approaches studied in the literature on how to prioritise data

points when labelling and how to iterate over the approach. We will nevertheless only

present the most common and straightforward methods.

The steps to use active learning on an unlabelled data set are:


1. The first thing which needs to happen is that a very small subsample of

this data needs to be manually labelled.

2. Once there is a small amount of labelled data, the model needs to be

trained on it. The model is of course not going to be great but will help us

get some insight on which areas of the parameter space need to be labelled

first to improve it.

3. After the model is trained, the model is used to predict the class of each

remaining unlabelled data point.

4. A score is chosen on each unlabelled data point based on the prediction of

the model. In the next subsection we will present some of the possible

scores most commonly used.

5. Once the best approach has been chosen to prioritise the labelling, this

process can be iteratively repeated: a new model can be trained on a new

labelled data set, which has been labelled based on the priority score. Once

the new model has been trained on the subset of data, the unlabelled data

points can be ran through the model to update the prioritisation scores to

continue labelling. In this way, one can keep optimising the labelling

strategy as the models become better and better.

Prioritisation scores

There are several approaches to assign a priority score to each data point. Below we

describe the three basic ones.


Least confidence:

This is probably the most simple method. It takes the highest probability for each

data point’s prediction, and sorts them from smaller to larger. The actual expression

to prioritise using least confidence would be:

Let’s use an example to see how this would work. Assume we have the following

data with three possible classes:

Table 1: Example of probability predictions of a model on three different classes for


four different data points.
In this case, the algorithm would first chose the maximum probability for each data

point, hence:

● X1: 0.9

● X2: 0.87

● X3:0.5

● X4:0.99.

The second step is to sort the data based on this maximum probability (from smaller

to bigger), hence X3, X2, X1 and X4.

Margin sampling:

This method takes into account the difference between the highest probability and the

second highest probability. Formally, the expression to prioritise would look like:
The data points with the lower margin sampling score would be the ones labelled the

first; these are the data points the model is least certain about between the most

probably and the next-to-most probable class.

Following the example of Table 1, the corresponding scores for each data point are:

● X1: 0.9–0.07 = 0.83

● X2: 0.87–0.1 = 0.86

● X3: 0.5–0.3 = 0.2

● X4: 0.99–0.01 = 0.98

Hence the data points would be shown to label as follows: X3, X1, X2 and X4. As

one can see the priority in this case is slightly different to the least confident one.

Entropy:

Finally, the last scoring function that we are gonna present here is the entropy score.

Entropy is a concept that comes from thermodynamics; in a simple way, it can be

understood as a measure of disorder in a system, for instance a gas in a closed box.

The higher the entropy the more disorder there is, whereas if the entropy is low, it
means that the gas might be mainly in one particular area such as a corner of the box

(maybe when the experiment started, before expanding across the box).

This concept can be reused to measure the certainty of a model. If a model is highly

certain about a class for a given data point, it will probably have a high certainty for a

particular class, whereas all the other classes will have low probability. Isn’t this very

similar to having a gas in the corner of a box? In this case we have most of the

probability assigned to a particular class. In the case of high entropy it would mean

that the model distributes equally the probability for all classes as it is not certain at

all which class that data point belongs to, similarly to having the gas distributed

equally in all parts of the box. It is therefore straightforward to prioritise data points

with higher entropy to the ones with lower entropy.

Formally, we can define the entropy score prioritisation as follows:

If we apply the entropy score to the example in Table 1:


● X1: -0.9*log(0.9)-0.07*log(0.07)-0.03*log(0.03) = 0.386

● X2: -0.87*log(0.87)-0.03*log(0.03)-0.1*log(0.1) = 0.457

● X3: -0.2*log(0.2)-0.5*log(0.5)-0.3*log(0.3) =1.03

● X4: -0*log(0)-0.01*log(0.01)-0.99*log(0.99) = 0.056

Note that for X4, 0 should be changed for a small epsilon (e.g. 0.00001) for

numerical stability.

In this case the data points should be shown in the following order: X3, X2, X1 and

X4, which coincides with the order of the least confident scoring method.

ENSEMBLE LEARNING:

Ensemble learning helps improve machine learning results by combining several


models. This approach allows the production of better predictive performance
compared to a single model. Basic idea is to learn a set of classifiers (experts) and to
allow them to vote.
Advantage : Improvement in predictive accuracy.
Disadvantage : It is difficult to understand an ensemble of classifiers.
Why do ensembles work?
Dietterich(2002) showed that ensembles overcome three problems –
● Statistical Problem –
The Statistical Problem arises when the hypothesis space is too large for the
amount of available data. Hence, there are many hypotheses with the same
accuracy on the data and the learning algorithm chooses only one of them!
There is a risk that the accuracy of the chosen hypothesis is low on unseen
data!
● Computational Problem –
The Computational Problem arises when the learning algorithm cannot
guarantee finding the best hypothesis.
● Representational Problem –
The Representational Problem arises when the hypothesis space does not
contain any good approximation of the target class(es).

Main Challenge for Developing Ensemble Models?


The main challenge is not to obtain highly accurate base models, but rather to obtain
base models which make different kinds of errors. For example, if ensembles are
used for classification, high accuracies can be accomplished if different base models
misclassify different training examples, even if the base classifier accuracy is low.
Methods for Independently Constructing Ensembles –

● Majority Vote
● Bagging and Random Forest
● Randomness Injection
● Feature-Selection Ensembles
● Error-Correcting Output Coding

Methods for Coordinated Construction of Ensembles –

● Boosting
● Stacking

Reliable Classification: Meta-Classifier Approach


Co-Training and Self-Training

Types of Ensemble Classifier –


Bagging:

Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision tree.


Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled
with replacement from D (i.e., bootstrap). Then a classifier model Mi is learned for
each training set D < i. Each classifier Mi returns its class prediction. The bagged
classifier M* counts the votes and assigns the class with the most votes to X
(unknown sample).

Implementation steps of Bagging –


1. Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of
each other.
4. The final predictions are determined by combining the predictions from all
the models.

5. Random Forest:Random Forest is an extension over bagging. Each classifier


in the ensemble is a decision tree classifier and is generated using a random
selection of attributes at each node to determine the split. During classification,
each tree votes and the most popular class is returned.

Implementation steps of Random Forest


1. Multiple subsets are created from the original data set, selecting observations
with replacement.
2. A subset of features is selected randomly and whichever feature gives the best
split is used to split the node iteratively.
3. The tree has grown to the largest.
4. Repeat the above steps and prediction is given based on the aggregation of
predictions from a number of trees.

BOOTSTRAP AGGREGATION:

Bagging, also known as bootstrap aggregation, is the ensemble learning method that
is commonly used to reduce variance within a noisy dataset. In bagging, a random
sample of data in a training set is selected with replacement—meaning that the
individual data points can be chosen more than once. After several data samples are
generated, these weak models are then trained independently, and depending on the
type of task—regression or classification, for example—the average or majority of
those predictions yield a more accurate estimate.

As a note, the random forest algorithm is considered an extension of the bagging


method, using both bagging and feature randomness to create an uncorrelated forest
of decision trees.

Ensemble learning

Ensemble learning gives credence to the idea of the “wisdom of crowds,” which
suggests that the decision-making of a larger group of people is typically better than
that of an individual expert. Similarly, ensemble learning refers to a group (or
ensemble) of base learners, or models, which work collectively to achieve a better
final prediction. A single model, also known as a base or weak learner, may not
perform well individually due to high variance or high bias. However, when weak
learners are aggregated, they can form a strong learner, as their combination reduces
bias or variance, yielding better model performance.

Ensemble methods are frequently illustrated using decision trees as this algorithm
can be prone to overfitting (high variance and low bias) when it hasn’t been pruned
and it can also lend itself to underfitting (low variance and high bias) when it’s very
small, like a decision stump, which is a decision tree with one level. Remember,
when an algorithm overfits or underfits to its training set, it cannot generalize well to
new datasets, so ensemble methods are used to counteract this behavior to allow for
generalization of the model to new datasets. While decision trees can exhibit high
variance or high bias, it’s worth noting that it is not the only modeling technique that
leverages ensemble learning to find the “sweet spot” within the bias-variance
tradeoff.

Bagging vs. boosting

Bagging and boosting are two main types of ensemble learning methods. The main
difference between these learning methods is the way in which they are trained. In
bagging, weak learners are trained in parallel, but in boosting, they learn sequentially.
This means that a series of models are constructed and with each new model
iteration, the weights of the misclassified data in the previous model are increased.
This redistribution of weights helps the algorithm identify the parameters that it
needs to focus on to improve its performance. AdaBoost, which stands for “adaptive
boosting algorithm,” is one of the most popular boosting algorithms as it was one of
the first of its kind. Other types of boosting algorithms include XGBoost,
GradientBoost, and BrownBoost.

Another difference in which bagging and boosting differ are the scenarios in which
they are used. For example, bagging methods are typically used on weak learners
which exhibit high variance and low bias, whereas boosting methods are leveraged
when low variance and high bias is observed.
How bagging works
In 1996, Leo Breiman introduced the bagging algorithm, which has three basic steps:

● Bootstrapping: Bagging leverages a bootstrapping sampling technique to


create diverse samples. This resampling method generates different subsets of the
training dataset by selecting data points at random and with replacement. This means
that each time you select a data point from the training dataset, you are able to select
the same instance multiple times. As a result, a value/instance repeated twice (or
more) in a sample.
● Parallel training: These bootstrap samples are then trained independently and
in parallel with each other using weak or base learners.
● Aggregation: Finally, depending on the task (i.e. regression or classification),
an average or a majority of the predictions are taken to compute a more accurate
estimate. In the case of regression, an average is taken of all the outputs predicted by
the individual classifiers; this is known as soft voting. For classification problems,
the class with the highest majority of votes is accepted; this is known as hard voting
or majority voting.

Benefits and challenges of bagging

There are a number of key advantages and challenges that the bagging method
presents when used for classification or regression problems. The key benefits of
bagging include:

​ ase of implementation: Python libraries such as scikit-learn (also known as sklearn)


E
make it easy to combine the predictions of base learners or estimators to improve
model performance. Their documentation (link resides outside IBM) lays out the
available modules that you can leverage in your model optimization.
​Reduction of variance: Bagging can reduce the variance within a learning algorithm.
This is particularly helpful with high-dimensional data, where missing values can
lead to higher variance, making it more prone to overfitting and preventing accurate
generalization to new datasets.

The key challenges of bagging include:

​ Loss of interpretability: It’s difficult to draw very precise business insights through
bagging because of the averaging involved across predictions. While the output is
more precise than any individual data point, a more accurate or complete dataset
could also yield more precision within a single classification or regression model.
​ Computationally expensive: Bagging slows down and grows more intensive as the
number of iterations increases. Thus, it’s not well-suited for real-time applications.
Clustered systems or a large number of processing cores are ideal for quickly
creating bagged ensembles on large test sets.
​ Less flexible: As a technique, bagging works particularly well with algorithms that
are less stable. One that is more stable or subject to high amounts of bias do not
provide as much benefit as there’s less variation within the dataset of the model.

Applications of Bagging

The bagging technique is used across a large number of industries, providing insights
for both real-world value and interesting perspectives, such as in the GRAMMY
Debates with Watson. Key use cases include:

● Healthcare: Bagging has been used to form medical data predictions. For
example, ensemble methods have been used for an array of bioinformatics problems,
such as gene and/or protein selection to identify a specific trait of interest.
● IT: Bagging can also improve the precision and accuracy in IT systems, such
as ones network intrusion detection systems.
● Environment: Ensemble methods, such as bagging, have been applied within
the field of remote sensing.
● Finance: Bagging has also been leveraged with deep learning models in the
finance industry, automating critical tasks, including fraud detection, credit risk
evaluations, and option pricing problems.

BOOSTING:

Boosting is a method used in machine learning to reduce errors in predictive data


analysis. Data scientists train machine learning software, called machine learning
models, on labeled data to make guesses about unlabeled data. A single machine
learning model might make prediction errors depending on the accuracy of the
training dataset. For example, if a cat-identifying model has been trained only on
images of white cats, it may occasionally misidentify a black cat. Boosting tries to
overcome this issue by training multiple models sequentially to improve the accuracy
of the overall system.

Why is boosting important?

Boosting improves machine models' predictive accuracy and performance by


converting multiple weak learners into a single strong learning model. Machine
learning models can be weak learners or strong learners:

Weak learners

Weak learners have low prediction accuracy, similar to random guessing. They are
prone to overfitting—that is, they can't classify data that varies too much from their
original dataset. For example, if you train the model to identify cats as animals with
pointed ears, it might fail to recognize a cat whose ears are curled.

Strong learners

Strong learners have higher prediction accuracy. Boosting converts a system of weak
learners into a single strong learning system. For example, to identify the cat image,
it combines a weak learner that guesses for pointy ears and another learner that
guesses for cat-shaped eyes. After analyzing the animal image for pointy ears, the
system analyzes it once again for cat-shaped eyes. This improves the system's overall
accuracy.

How does boosting work?

To understand how boosting works, let's describe how machine learning models
make decisions. Although there are many variations in implementation, data
scientists often use boosting with decision-tree algorithms:

Decision trees

Decision trees are data structures in machine learning that work by dividing the
dataset into smaller and smaller subsets based on their features. The idea is that
decision trees split up the data repeatedly until there is only one class left. For
example, the tree may ask a series of yes or no questions and divide the data into
categories at every step.

Boosting ensemble method

Boosting creates an ensemble model by combining several weak decision trees


sequentially. It assigns weights to the output of individual trees. Then it gives
incorrect classifications from the first decision tree a higher weight and input to the
next tree. After numerous cycles, the boosting method combines these weak rules
into a single powerful prediction rule.

Boosting compared to bagging

Boosting and bagging are the two common ensemble methods that improve
prediction accuracy. The main difference between these learning methods is the
method of training. In bagging, data scientists improve the accuracy of weak learners
by training several of them at once on multiple datasets. In contrast, boosting trains
weak learners one after another.

How is training in boosting done?

The training method varies depending on the type of boosting process called the
boosting algorithm. However, an algorithm takes the following general steps to train
the boosting model:

Step 1

The boosting algorithm assigns equal weight to each data sample. It feeds the data to
the first machine model, called the base algorithm. The base algorithm makes
predictions for each data sample.

Step 2
The boosting algorithm assesses model predictions and increases the weight of
samples with a more significant error. It also assigns a weight based on model
performance. A model that outputs excellent predictions will have a high amount of
influence over the final decision.

Step 3

The algorithm passes the weighted data to the next decision tree.

Step 4

The algorithm repeats steps 2 and 3 until instances of training errors are below a
certain threshold.

What are the types of boosting?

The following are the three main types of boosting:

Adaptive boosting

Adaptive Boosting (AdaBoost) was one of the earliest boosting models developed. It
adapts and tries to self-correct in every iteration of the boosting process.

AdaBoost initially gives the same weight to each dataset. Then, it automatically
adjusts the weights of the data points after every decision tree. It gives more weight
to incorrectly classified items to correct them for the next round. It repeats the
process until the residual error, or the difference between actual and predicted values,
falls below an acceptable threshold.

You can use AdaBoost with many predictors, and it is typically not as sensitive as
other boosting algorithms. This approach does not work well when there is a
correlation among features or high data dimensionality. Overall, AdaBoost is a
suitable type of boosting for classification problems.
Gradient boosting

Gradient Boosting (GB) is similar to AdaBoost in that it, too, is a sequential training
technique. The difference between AdaBoost and GB is that GB does not give
incorrectly classified items more weight. Instead, GB software optimizes the loss
function by generating base learners sequentially so that the present base learner is
always more effective than the previous one. This method attempts to generate
accurate results initially instead of correcting errors throughout the process, like
AdaBoost. For this reason, GB software can lead to more accurate results. Gradient
Boosting can help with both classification and regression-based problems.

Extreme gradient boosting

Extreme Gradient Boosting (XGBoost) improves gradient boosting for computational


speed and scale in several ways. XGBoost uses multiple cores on the CPU so that
learning can occur in parallel during training. It is a boosting algorithm that can
handle extensive datasets, making it attractive for big data applications. The key
features of XGBoost are parallelization, distributed computing, cache optimization,
and out-of-core processing.

What are the benefits of boosting?

Boosting offers the following major benefits:

Ease of implementation

Boosting has easy-to-understand and easy-to-interpret algorithms that learn from


their mistakes. These algorithms don't require any data preprocessing, and they have
built-in routines to handle missing data. In addition, most languages have built-in
libraries to implement boosting algorithms with many parameters that can fine-tune
performance.

Reduction of bias

Bias is the presence of uncertainty or inaccuracy in machine learning results.


Boosting algorithms combine multiple weak learners in a sequential method, which
iteratively improves observations. This approach helps to reduce high bias that is
common in machine learning models.
Computational efficiency

Boosting algorithms prioritize features that increase predictive accuracy during


training. They can help to reduce data attributes and handle large datasets efficiently.

What are the challenges of boosting?

The following are common limitations of boosting modes:

Vulnerability to outlier data

Boosting models are vulnerable to outliers or data values that are different from the
rest of the dataset. Because each model attempts to correct the faults of its
predecessor, outliers can skew results significantly.

Real-time implementation

You might also find it challenging to use boosting for real-time implementation
because the algorithm is more complex than other processes. Boosting methods have
high adaptability, so you can use a wide variety of model parameters that
immediately affect the model's performance.

Gradient Boosting Machines

Gradient Boosting is a popular boosting algorithm. In gradient boosting, each


predictor corrects its predecessor’s error. In contrast to Adaboost, the weights of the
training instances are not tweaked, instead, each predictor is trained using the
residual errors of the predecessor as labels.
There is a technique called the Gradient Boosted Trees whose base learner is CART
(Classification and Regression Trees).
The below diagram explains how gradient boosted trees are trained for regression
problems.
Gradient Boosted Trees for Regression

The ensemble consists of N trees. Tree1 is trained using the feature matrix X and the
labels y. The predictions labelled y1(hat) are used to determine the training set
residual errors r1. Tree2 is then trained using the feature matrix X and the residual
errors of Tree1 as labels. The predicted results r1(hat) are then used to determine the
residual r2. The process is repeated until all the N trees forming the ensemble are
trained.
Deep Learning
Deep learning is a subset of machine learning, which is essentially a neural network
with three or more layers. These neural networks attempt to simulate the behavior of
the human brain—albeit far from matching its ability—allowing it to “learn” from
large amounts of data.
Architectures :
1. Deep Neural Network – It is a neural network with a certain level of
complexity (having multiple hidden layers in between input and output
layers). They are capable of modeling and processing non-linear
relationships.
2. Deep Belief Network(DBN) – It is a class of Deep Neural Network. It is a
multi-layer belief network. Steps for performing DBN : a. Learn a layer of
features from visible units using the Contrastive Divergence algorithm. b.
Treat activations of previously trained features as visible units and then
learn features of features. c. Finally, the whole DBN is trained when the
learning for the final hidden layer is achieved.
3. Recurrent (perform same task for every element of a sequence) Neural
Network – Allows for parallel and sequential computation. Similar to the
human brain (large feedback network of connected neurons). They are able
to remember important things about the input they received and hence
enables them to be more precise.

You might also like