Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Module 5

Combining Inductive And Analytical


Learning
Reinforcement learning
Reinforcement learning is an area of Machine Learning. It is about taking
suitable action to maximize reward in a particular situation. It is employed by
various software and machines to find the best possible behavior or path it
should take in a specific situation. Reinforcement learning differs from
supervised learning in a way that in supervised learning the training data has
the answer key with it so the model is trained with the correct answer itself
whereas in reinforcement learning, there is no answer but the reinforcement
agent decides what to do to perform the given task. In the absence of a
training dataset, it is bound to learn from its experience.
Example: The problem is as follows: We have an agent and a reward, with
many hurdles in between. The agent is supposed to find the best possible
path to reach the reward. The following problem explains the problem more
easily.
The above image shows the robot, diamond, and fire. The goal of the robot
is to get the reward that is the diamond and avoid the hurdles that are fired.
The robot learns by trying all the possible paths and then choosing the path
which gives him the reward with the least hurdles. Each right step will give
the robot a reward and each wrong step will subtract the reward of the robot.
The total reward will be calculated when it reaches the final reward that is the
diamond.
Main points in Reinforcement learning –

• Input: The input should be an initial state from which the model will
start
• Output: There are many possible outputs as there are a variety of
solutions to a particular problem
• Training: The training is based upon the input, The model will return
a state and the user will decide to reward or punish the model
based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.
Types of Reinforcement: There are two types of Reinforcement:

1. Positive –
Positive Reinforcement is defined as when an event, occurs due to
a particular behavior, increases the strength and the frequency of
the behavior. In other words, it has a positive effect on behavior.
Advantages of reinforcement learning are:
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states
which can diminish the results
2. Negative –
Negative Reinforcement is defined as strengthening of behavior
because a negative condition is stopped or avoided.
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior

Learning Task
Machine Learning is of two types Supervised Learning and Unsupervised Learning.

Supervised Machine Learning:


Supervised learning is a machine learning method in which models are trained using
labeled data. In supervised learning, models need to find the mapping function to map
the input variable (X) with the output variable (Y).

Supervised learning needs supervision to train the model, which is similar to as a


student learns things in the presence of a teacher. Supervised learning can be used for
two types of problems: Classification and Regression.

Example: Suppose we have an image of different types of fruits. The task of our
supervised learning model is to identify the fruits and classify them accordingly. So to
identify the image in supervised learning, we will give the input data as well as output
for that, which means we will train the model by the shape, size, color, and taste of
each fruit. Once the training is completed, we will test the model by giving the new set
of fruit. The model will identify the fruit and predict the output using a suitable
algorithm.

Unsupervised Machine Learning:


Unsupervised learning is another machine learning method in which patterns inferred
from the unlabeled input data. The goal of unsupervised learning is to find the
structure and patterns from the input data. Unsupervised learning does not need any
supervision. Instead, it finds patterns from the data by its own.
Learn more Unsupervised Machine Learning

Unsupervised learning can be used for two types of


problems: Clustering and Association.

Example: To understand the unsupervised learning, we will use the example given
above. So unlike supervised learning, here we will not provide any supervision to the
model. We will just provide the input dataset to the model and allow the model to find
the patterns from the data. With the help of a suitable algorithm, the model will train
itself and divide the fruits into different groups according to the most similar features
between them.

Q-Learning Explanation:
o Q-learning is a popular model-free reinforcement learning algorithm based on the
Bellman equation.
o The main objective of Q-learning is to learn the policy which can inform the agent
that what actions should be taken for maximizing the reward under what
circumstances.
o It is an off-policy RL that attempts to find the best action to take at a current state.
o The goal of the agent in Q-learning is to maximize the value of Q.
o The value of Q-learning can be derived from the Bellman equation. Consider the
Bellman equation given below:

In the equation, we have various components, including reward, discount factor (γ),
probability, and end states s'. But there is no any Q-value is given so first consider the
below image:
In the above image, we can see there is an agent who has three values options, V(s1),
V(s2), V(s3). As this is MDP, so agent only cares for the current state and the future
state. The agent can go to any direction (Up, Left, or Right), so he needs to decide
where to go for the optimal path. Here agent will take a move as per probability bases
and changes the state. But if we want some exact moves, so for this, we need to make
some changes in terms of Q-value. Consider the below image:

Q- represents the quality of the actions at each state. So instead of using a value at
each state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that
which action is more lubricative than others, and according to the best Q-value, the
agent takes his next move. The Bellman equation can be used for deriving the Q-value.
To perform any action, the agent will get a reward R(s, a), and also he will end up on a
certain state, so the Q -value equation will be:

Hence, we can say that, V(s) = max [Q(s, a)]

The above formula is used to estimate the Q-values in Q-Learning.

Non-Deterministic Algorithm
A non-deterministic algorithm can provide different outputs for the same
input on different executions. Unlike a deterministic algorithm which
produces only a single output for the same input even on different runs, a
non-deterministic algorithm travels in various routes to arrive at the
different outcomes.

Non-deterministic algorithms are useful for finding approximate solutions,


when an exact solution is difficult or expensive to derive using a
deterministic algorithm.

One example of a non-deterministic algorithm is the execution of


concurrent algorithms with race conditions, which can exhibit different
outputs on different runs. Unlike a deterministic algorithm which travels a
single path from input to output, a non-deterministic algorithm can take
many paths, with some arriving at the same outputs, and others arriving
at different outputs. This feature is mathematically used in non-
deterministic computation models like non-deterministic finite automaton.

A non-deterministic algorithm is capable of execution on a deterministic


computer which has an unlimited number of parallel processors. A non-
deterministic algorithm usually has two phases and output steps. The first
phase is the guessing phase, which makes use of arbitrary characters to
run the problem.

The second phase is the verifying phase, which returns true or false for
the chosen string. There are many problems which can be conceptualized
with help of non-deterministic algorithms including the unresolved
problem of P vs NP in computing theory.

Non-deterministic algorithms are used in solving problems which allow


multiple outcomes. Every outcome the non-deterministic algorithm
produces is valid, regardless of the choices made by the algorithm during
execution.

Action (A):
A is the set of all possible moves the agent can make. An action is almost self-

explanatory, but it should be noted that agents usually choose from a list of discrete,
possible actions. In video games, the list might include running right or left, jumping
high or low, crouching or standing still. In the stock markets, the list might include
buying, selling or holding any one of an array of securities and their derivatives. When
handling aerial drones, alternatives would include many different velocities and
accelerations in 3D space.

Reward (R):
A reward is the feedback by which we measure the success or failure of an agent’s
actions in a given state. For example, in a video game, when Mario touches a coin, he
wins points. From any given state, an agent sends output in the form of actions to the
environment, and the environment returns the agent’s new state (which resulted from
acting on the previous state) as well as rewards, if there are any. Rewards can be
immediate or delayed. They effectively evaluate the agent’s action.

Temporal difference learning


What is temporal difference learning?
Temporal Difference Learning (also known as TD Learning) is an unsupervised learning
technique that is very commonly used in reinforcement learning for the purpose of
predicting the total reward expected over the future. They can, however, be used to predict
other quantities as well. It is essentially a way to learn how to predict a quantity that is
dependent on the future values of a given signal. Temporal difference learning is a method
that is used to compute the long-term utility of a pattern of behavior from a series of
intermediate rewards.

There are also continuous-time temporal difference learning algorithms that have been
developed.
Essentially, TD Learning focuses on predicting a variable's future value in a sequence of
states. Temporal differnce learning was a major breakthrough in solving the problem of
reward prediction. You could say that iIt employs a mathematical trick that allows it to
replace complicated reasoning with a simple learning procedure that can be used to generate
the very same results.

The trick is that rather than attempting to calculate the total future reward, temporal
difference learning just attempts to predict the combination of immediate reward and its own
reward prediction at the next moment in time. Now when the next moment comes and brings
fresh information with it, the new prediction is compared with the expected prediction. If
these two predictions are different from each other, the TD algorithm will calculate how
different the predictions are from each other and make use of this temporal difference to
adjust the old prediction toward the new prediction.

Source: ResearchGate

The temporal difference algorithm always aims to bring the expected prediction and the new
prediction together, thus matching expectations with reality and gradually increasing the
accuracy of the entire chain of prediction.

Temporal Difference Learning aims to predict a combination of the immediate reward and its
own reward prediction at the next moment in time.

In TD Learning, the training signal for a prediction is a future prediction. This method is a
combination of the Monte Carlo (MC) method and the Dynamic Programming (DP) method.
Monte Carlo methods adjust their estimates only after the final outcome is known, but
temporal difference methods tend to adjust predictions to match later, more accurate,
predictions for the future, much before the final outcome is clear and know. This is
essentially a type of bootstrapping.
Temporal difference learning got its name from the way it uses changes, or differences, in
predictions over successive time steps for the purpose of driving the learning process.

The prediction at any particular time step gets updated to bring it nearer to the prediction of
the same quantity at the next time step.

These TD methods bear relations to the temporal difference model of animal learning.

What are the parameters used in temporal


difference learning?
• Alpha (α): learning rate
It shows how much our estimates should be adjusted, based on the error.
This rate varies between 0 and 1.
• Gamma (γ): the discount rate
This indicates how much future rewards are valued. A larger discount rate
signifies that future rewards are valued to a greater extent. The discount
rate also varies between 0 and 1.
• e: the ratio reflective of exploration vs. exploitation.
This involves exploring new options with probability e and staying at the
current max with probability 1-e. A larger e signifies that more exploration
is carried out during training.

What is Generalization?
The term ‘generalization’ refers to a model’s ability to adapt and react
appropriately to previously unseen, fresh data chosen from the same
distribution as the model’s initial input. In other words, generalization
assesses a model’s ability to process new data and generate accurate
predictions after being trained on a training set.

A model’s ability to generalize is critical to its success. Over-training


on training data will prevent a model from generalizing. In such cases,
when new data is supplied, it will make inaccurate predictions. Even if
the model is capable of making accurate predictions based on the
training data set, it will be rendered ineffective.
This is referred to as overfitting. The contrary is also true
(underfitting), which occurs when a model is trained with insufficient
data. Even given the training data, your model would fail to produce
correct predictions if it was under-fitted. This would render the model
as ineffective as overfitting.

Reasoning about the Generalization


Overfitting occurs when a network performs well on the training set
but performs poorly in general. If the training set contains
unintentional regularities, the network may overfit. Suppose If the job
is to categorize handwritten numbers, for example, it’s possible that all
photos of 9s in the training set have pixel number 122 on, while all
other samples have it off.

The network may elect to take advantage of this coincidental


regularity, accurately identifying all of the training samples of 9’s
without having to learn the true regularities. The network will not
generalize well if this property does not hold on to the test set.

Source

Consider how the training and generalization error fluctuates as a


function of the number of training examples and the number of
parameters in order to reason qualitatively about generalization. More
training data should only aid generalization: the larger the training set
for any given test case, the more likely there will be a closely related
training example. Furthermore, as the training set grows larger, the
number of accidental regularities decreases, forcing the network to
focus on the real regularities.
As a result, generalization error should decrease as more training
instances are added. Small training sets, on the other hand, are
easier to memorize than big ones, therefore training error tends to
grow as we add more examples. The two will eventually meet as the
training set grows in size. This is depicted qualitatively in the figure
above.

Now consider the model’s capability. The more parameters we add,


the easier it is to fit both the accidental and real regularities of the
training data. As a result, as we add more factors, the training error
decreases. The influence on generalization mistakes is subtler. If the
network has insufficient capacity, it generalizes poorly because it fails
to detect regularities (whether true or accidental) in the data. It will
memorize the training set and fail to generalize if it has too much
capacity.

As a result, capacity has a non-monotonic influence on test error: it


decreases and then increases. We’d like to create network topologies
that are powerful enough to learn actual regularities in training data
but not powerful enough to merely remember the training set or
exploit accidental regularities. This is depicted qualitatively in the
figure above.

Measuring Generalization
In most scenarios, we generally focused on the training set or tried to
optimize the performance on the training set. But this is not the correct
way of model building, there is a lot of uncertainty (such as noise) in
the unseen data which is taken from the same distribution as the
training set. So in such cases, we also should aim for our model that
can generalize well on those unseen data.

Fortunately, there is a simple method for assessing a model’s


generalization performance. Simply put, we divide our data into three
subsets.

• A training set is a collection of training examples on which the


network is trained.
• A validation set is used to fine-tune hyperparameters like the
number of hidden units and the learning rate.
• A test set designed to evaluate generalization performance.
The losses on these subsets are referred to as training, validation,
and test loss, in that order. It should be evident why we need distinct
training and test sets: if we train on test data, we have no notion if the
model is correctly generalizing or merely memorizing the training
examples.

There are other variations on this basic method, including what is


known as cross-validation. These options are typically employed in
cases with tiny datasets, i.e. less than a few thousand examples. The
majority of advanced machine learning applications include datasets
large enough to be divided into training, validation, and test sets.

Apart from all these techniques, there is one called Regularization.


Regularization has no effect on the algorithm’s performance on the
data set used to learn the model parameters (feature weights). It can,
however, increase generalization performance, i.e., performance on
new, previously unknown data, which is exactly what we want.

Dynamic Programming
Dynamic programming is a technique that breaks the problems into sub-problems,
and saves the result for future purposes so that we do not need to compute the result
again. The subproblems are optimized to optimize the overall solution is known as
optimal substructure property. The main use of dynamic programming is to solve
optimization problems. Here, optimization problems mean that when we are trying to
find out the minimum or the maximum solution of a problem. The dynamic
programming guarantees to find the optimal solution of a problem if the solution
exists.

The definition of dynamic programming says that it is a technique for solving a


complex problem by first breaking into a collection of simpler subproblems, solving
each subproblem just once, and then storing their solutions to avoid repetitive
computations.

Let's understand this approach through an example.

Consider an example of the Fibonacci series. The following series is the Fibonacci
series:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ,…


The numbers in the above series are not randomly calculated. Mathematically, we
could write each of the terms using the below formula:

F(n) = F(n-1) + F(n-2),

With the base values F(0) = 0, and F(1) = 1. To calculate the other numbers, we follow
the above relationship. For example, F(2) is the sum f(0) and f(1), which is equal to 1.

How can we calculate F(20)?


The F(20) term will be calculated using the nth formula of the Fibonacci series. The
below figure shows that how F(20) is calculated.

As we can observe in the above figure that F(20) is calculated as the sum of F(19) and
F(18). In the dynamic programming approach, we try to divide the problem into the
similar subproblems. We are following this approach in the above case where F(20)
into the similar subproblems, i.e., F(19) and F(18). If we recap the definition of dynamic
programming that it says the similar subproblem should not be computed more than
once. Still, in the above case, the subproblem is calculated twice. In the above example,
F(18) is calculated two times; similarly, F(17) is also calculated twice. However, this
technique is quite useful as it solves the similar subproblems, but we need to be
cautious while storing the results because we are not particular about storing the result
that we have computed once, then it can lead to a wastage of resources.

In the above example, if we calculate the F(18) in the right subtree, then it leads to the
tremendous usage of resources and decreases the overall performance.
The solution to the above problem is to save the computed results in an array. First,
we calculate F(16) and F(17) and save their values in an array. The F(18) is calculated
by summing the values of F(17) and F(16), which are already saved in an array. The
computed value of F(18) is saved in an array. The value of F(19) is calculated using the
sum of F(18), and F(17), and their values are already saved in an array. The computed
value of F(19) is stored in an array. The value of F(20) can be calculated by adding the
values of F(19) and F(18), and the values of both F(19) and F(18) are stored in an array.
The final computed value of F(20) is stored in an array.

How does the dynamic programming approach work?


The following are the steps that the dynamic programming follows:

o It breaks down the complex problem into simpler subproblems.


o It finds the optimal solution to these sub-problems.
o It stores the results of subproblems (memoization). The process of storing the results
of subproblems is known as memorization.
o It reuses them so that same sub-problem is calculated more than once.
o Finally, calculate the result of the complex problem.

The above five steps are the basic steps for dynamic programming. The dynamic
programming is applicable that are having properties such as:

Those problems that are having overlapping subproblems and optimal substructures.
Here, optimal substructure means that the solution of optimization problems can be
obtained by simply combining the optimal solution of all the subproblems.

In the case of dynamic programming, the space complexity would be increased as we


are storing the intermediate results, but the time complexity would be decreased.

Approaches of dynamic programming


There are two approaches to dynamic programming:

o Top-down approach
o Bottom-up approach

Top-down approach
The top-down approach follows the memorization technique, while bottom-up
approach follows the tabulation method. Here memorization is equal to the sum of
recursion and caching. Recursion means calling the function itself, while caching means
storing the intermediate results.

Advantages

o It is very easy to understand and implement.


o It solves the subproblems only when it is required.
o It is easy to debug.

Disadvantages

It uses the recursion technique that occupies more memory in the call stack.
Sometimes when the recursion is too deep, the stack overflow condition will occur.

It occupies more memory that degrades the overall performance.

You might also like