Professional Documents
Culture Documents
ML Mod 5 SEM
ML Mod 5 SEM
• Input: The input should be an initial state from which the model will
start
• Output: There are many possible outputs as there are a variety of
solutions to a particular problem
• Training: The training is based upon the input, The model will return
a state and the user will decide to reward or punish the model
based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.
Types of Reinforcement: There are two types of Reinforcement:
1. Positive –
Positive Reinforcement is defined as when an event, occurs due to
a particular behavior, increases the strength and the frequency of
the behavior. In other words, it has a positive effect on behavior.
Advantages of reinforcement learning are:
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states
which can diminish the results
2. Negative –
Negative Reinforcement is defined as strengthening of behavior
because a negative condition is stopped or avoided.
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
Learning Task
Machine Learning is of two types Supervised Learning and Unsupervised Learning.
Example: Suppose we have an image of different types of fruits. The task of our
supervised learning model is to identify the fruits and classify them accordingly. So to
identify the image in supervised learning, we will give the input data as well as output
for that, which means we will train the model by the shape, size, color, and taste of
each fruit. Once the training is completed, we will test the model by giving the new set
of fruit. The model will identify the fruit and predict the output using a suitable
algorithm.
Example: To understand the unsupervised learning, we will use the example given
above. So unlike supervised learning, here we will not provide any supervision to the
model. We will just provide the input dataset to the model and allow the model to find
the patterns from the data. With the help of a suitable algorithm, the model will train
itself and divide the fruits into different groups according to the most similar features
between them.
Q-Learning Explanation:
o Q-learning is a popular model-free reinforcement learning algorithm based on the
Bellman equation.
o The main objective of Q-learning is to learn the policy which can inform the agent
that what actions should be taken for maximizing the reward under what
circumstances.
o It is an off-policy RL that attempts to find the best action to take at a current state.
o The goal of the agent in Q-learning is to maximize the value of Q.
o The value of Q-learning can be derived from the Bellman equation. Consider the
Bellman equation given below:
In the equation, we have various components, including reward, discount factor (γ),
probability, and end states s'. But there is no any Q-value is given so first consider the
below image:
In the above image, we can see there is an agent who has three values options, V(s1),
V(s2), V(s3). As this is MDP, so agent only cares for the current state and the future
state. The agent can go to any direction (Up, Left, or Right), so he needs to decide
where to go for the optimal path. Here agent will take a move as per probability bases
and changes the state. But if we want some exact moves, so for this, we need to make
some changes in terms of Q-value. Consider the below image:
Q- represents the quality of the actions at each state. So instead of using a value at
each state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that
which action is more lubricative than others, and according to the best Q-value, the
agent takes his next move. The Bellman equation can be used for deriving the Q-value.
To perform any action, the agent will get a reward R(s, a), and also he will end up on a
certain state, so the Q -value equation will be:
Non-Deterministic Algorithm
A non-deterministic algorithm can provide different outputs for the same
input on different executions. Unlike a deterministic algorithm which
produces only a single output for the same input even on different runs, a
non-deterministic algorithm travels in various routes to arrive at the
different outcomes.
The second phase is the verifying phase, which returns true or false for
the chosen string. There are many problems which can be conceptualized
with help of non-deterministic algorithms including the unresolved
problem of P vs NP in computing theory.
Action (A):
A is the set of all possible moves the agent can make. An action is almost self-
explanatory, but it should be noted that agents usually choose from a list of discrete,
possible actions. In video games, the list might include running right or left, jumping
high or low, crouching or standing still. In the stock markets, the list might include
buying, selling or holding any one of an array of securities and their derivatives. When
handling aerial drones, alternatives would include many different velocities and
accelerations in 3D space.
Reward (R):
A reward is the feedback by which we measure the success or failure of an agent’s
actions in a given state. For example, in a video game, when Mario touches a coin, he
wins points. From any given state, an agent sends output in the form of actions to the
environment, and the environment returns the agent’s new state (which resulted from
acting on the previous state) as well as rewards, if there are any. Rewards can be
immediate or delayed. They effectively evaluate the agent’s action.
There are also continuous-time temporal difference learning algorithms that have been
developed.
Essentially, TD Learning focuses on predicting a variable's future value in a sequence of
states. Temporal differnce learning was a major breakthrough in solving the problem of
reward prediction. You could say that iIt employs a mathematical trick that allows it to
replace complicated reasoning with a simple learning procedure that can be used to generate
the very same results.
The trick is that rather than attempting to calculate the total future reward, temporal
difference learning just attempts to predict the combination of immediate reward and its own
reward prediction at the next moment in time. Now when the next moment comes and brings
fresh information with it, the new prediction is compared with the expected prediction. If
these two predictions are different from each other, the TD algorithm will calculate how
different the predictions are from each other and make use of this temporal difference to
adjust the old prediction toward the new prediction.
Source: ResearchGate
The temporal difference algorithm always aims to bring the expected prediction and the new
prediction together, thus matching expectations with reality and gradually increasing the
accuracy of the entire chain of prediction.
Temporal Difference Learning aims to predict a combination of the immediate reward and its
own reward prediction at the next moment in time.
In TD Learning, the training signal for a prediction is a future prediction. This method is a
combination of the Monte Carlo (MC) method and the Dynamic Programming (DP) method.
Monte Carlo methods adjust their estimates only after the final outcome is known, but
temporal difference methods tend to adjust predictions to match later, more accurate,
predictions for the future, much before the final outcome is clear and know. This is
essentially a type of bootstrapping.
Temporal difference learning got its name from the way it uses changes, or differences, in
predictions over successive time steps for the purpose of driving the learning process.
The prediction at any particular time step gets updated to bring it nearer to the prediction of
the same quantity at the next time step.
These TD methods bear relations to the temporal difference model of animal learning.
What is Generalization?
The term ‘generalization’ refers to a model’s ability to adapt and react
appropriately to previously unseen, fresh data chosen from the same
distribution as the model’s initial input. In other words, generalization
assesses a model’s ability to process new data and generate accurate
predictions after being trained on a training set.
Source
Measuring Generalization
In most scenarios, we generally focused on the training set or tried to
optimize the performance on the training set. But this is not the correct
way of model building, there is a lot of uncertainty (such as noise) in
the unseen data which is taken from the same distribution as the
training set. So in such cases, we also should aim for our model that
can generalize well on those unseen data.
Dynamic Programming
Dynamic programming is a technique that breaks the problems into sub-problems,
and saves the result for future purposes so that we do not need to compute the result
again. The subproblems are optimized to optimize the overall solution is known as
optimal substructure property. The main use of dynamic programming is to solve
optimization problems. Here, optimization problems mean that when we are trying to
find out the minimum or the maximum solution of a problem. The dynamic
programming guarantees to find the optimal solution of a problem if the solution
exists.
Consider an example of the Fibonacci series. The following series is the Fibonacci
series:
With the base values F(0) = 0, and F(1) = 1. To calculate the other numbers, we follow
the above relationship. For example, F(2) is the sum f(0) and f(1), which is equal to 1.
As we can observe in the above figure that F(20) is calculated as the sum of F(19) and
F(18). In the dynamic programming approach, we try to divide the problem into the
similar subproblems. We are following this approach in the above case where F(20)
into the similar subproblems, i.e., F(19) and F(18). If we recap the definition of dynamic
programming that it says the similar subproblem should not be computed more than
once. Still, in the above case, the subproblem is calculated twice. In the above example,
F(18) is calculated two times; similarly, F(17) is also calculated twice. However, this
technique is quite useful as it solves the similar subproblems, but we need to be
cautious while storing the results because we are not particular about storing the result
that we have computed once, then it can lead to a wastage of resources.
In the above example, if we calculate the F(18) in the right subtree, then it leads to the
tremendous usage of resources and decreases the overall performance.
The solution to the above problem is to save the computed results in an array. First,
we calculate F(16) and F(17) and save their values in an array. The F(18) is calculated
by summing the values of F(17) and F(16), which are already saved in an array. The
computed value of F(18) is saved in an array. The value of F(19) is calculated using the
sum of F(18), and F(17), and their values are already saved in an array. The computed
value of F(19) is stored in an array. The value of F(20) can be calculated by adding the
values of F(19) and F(18), and the values of both F(19) and F(18) are stored in an array.
The final computed value of F(20) is stored in an array.
The above five steps are the basic steps for dynamic programming. The dynamic
programming is applicable that are having properties such as:
Those problems that are having overlapping subproblems and optimal substructures.
Here, optimal substructure means that the solution of optimization problems can be
obtained by simply combining the optimal solution of all the subproblems.
o Top-down approach
o Bottom-up approach
Top-down approach
The top-down approach follows the memorization technique, while bottom-up
approach follows the tabulation method. Here memorization is equal to the sum of
recursion and caching. Recursion means calling the function itself, while caching means
storing the intermediate results.
Advantages
Disadvantages
It uses the recursion technique that occupies more memory in the call stack.
Sometimes when the recursion is too deep, the stack overflow condition will occur.