ML Unit IV

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 27

Genetic Algorithms

A genetic algorithm is an adaptive experiential search algorithm inspired by "Darwin's


theory of evolution in Nature." It is used to solve optimization problems in machine learning. It
is one of the important algorithms as it helps solve complex problems that would take a long
time to solve.

Genetic algorithms are based on the ideas of natural selection and genetics. These are
intelligent exploitation of random search provided with historical data to direct the search
into the region of better performance in solution space. They are commonly used to
generate high-quality solutions for optimization problems and search problems.

Genetic algorithms simulate the process of natural selection which means those
species who can adapt to changes in their environment are able to survive and
reproduce and go to next generation. In simple words, they simulate “survival of the
fittest” among individual of consecutive generation for solving a problem. Each
generation consist of a population of individuals and each individual represents a
point in search space and possible solution. Each individual is represented as a string of
character/integer/float/bits. This string is analogous to the Chromosome.

Genetic Algorithms are being widely used in different real-world applications, for
example, Designing electronic circuits, code-breaking, image processing, and artificial
creativity.

Foundation of GAs
Genetic algorithms are based on an analogy with genetic structure and behaviour of
chromosomes of the population. Following is the foundation of GAs based on this
analogy –

1. Individual in population compete for resources and mate


2. Those individuals who are successful (fittest) then mate to create more offspring than
others
3. Genes from “fittest” parent propagate throughout the generation, that is sometimes
parents create offspring which is better than either parent.
4. Thus each successive generation is more suited for their environment.

It basically involves five phases to solve the complex optimization problems, which are given as
below:

o Initialization
o Fitness Assignment
o Selection
o Reproduction
o Termination

1. Initialization
The process of a genetic algorithm starts by generating the set of individuals, which is called
population. Here each individual is the solution for the given problem. An individual contains or is
characterized by a set of parameters called Genes. Genes are combined into a string and generate
chromosomes, which is the solution to the problem. One of the most popular techniques for
initialization is the use of random binary strings.

2) Fitness Score

A Fitness Score is given to each individual which shows the ability of an individual to
“compete”. The individual having optimal fitness score (or near optimal) are sought.

The GAs maintains the population of n individuals (chromosome/solutions) along with


their fitness scores.The individuals having better fitness scores are given more chance
to reproduce than others. The individuals with better fitness scores are selected who
mate and produce better offspring by combining chromosomes of parents. The
population size is static so the room has to be created for new arrivals. So, some
individuals die and get replaced by new arrivals eventually creating new generation
when all the mating opportunity of the old population is exhausted. It is hoped that over
successive generations better solutions will arrive while least fit die.
Each new generation has on average more “better genes” than the individual (solution)
of previous generations. Thus each new generations have better “partial
solutions” than previous generations. Once the offspring produced having no
significant difference from offspring produced by previous populations, the population is
converged. The algorithm is said to be converged to a set of solutions for the problem.

3) Selection

The selection phase involves the selection of individuals for the reproduction of offspring. All the
selected individuals are then arranged in a pair of two to increase reproduction. Then these
individuals transfer their genes to the next generation.
4) Reproduction

Once the initial generation is created, the algorithm evolves the generation using
following operators –

1) Selection Operator: The idea is to give preference to the individuals with good
fitness scores and allow them to pass their genes to successive generations.

2) Crossover Operator: This represents mating between individuals. Two individuals


are selected using selection operator and crossover sites are chosen randomly. Then
the genes at these crossover sites are exchanged thus creating a completely new
individual (offspring). For example –

3) Mutation Operator: The key idea is to insert random genes in offspring to maintain
the diversity in the population to avoid premature convergence. For example –

The whole algorithm can be summarized as –


1) Randomly initialize populations p
2) Determine fitness of population
3) Until convergence repeat:
a) Select parents from population
b) Crossover and generate new population
c) Perform mutation on new population
d) Calculate fitness for new population

5. Termination

After the reproduction phase, a stopping criterion is applied as a base for termination. The
algorithm terminates after the threshold fitness solution is reached. It will identify the final
solution as the best solution in the population.
Example problem and solution using Genetic Algorithms
Given a target string, the goal is to produce target string starting from a random string of
the same length. In the following implementation, following analogies are made –
 Characters A-Z, a-z, 0-9, and other special symbols are considered as genes
 A string generated by these characters is considered as
chromosome/solution/Individual

Fitness score is the number of characters which differ from characters in target string
at a particular index. So individual having lower fitness value is given more preference.

Why use Genetic Algorithms

 They are Robust(Strong)


 Provide optimisation over large space state.
 Unlike traditional AI, they do not break on slight change in input or presence of noise

Application of Genetic Algorithms


Genetic algorithms have many applications, some of them are –
 Recurrent Neural Network
 Mutation testing
 Code breaking
 Filtering and signal processing
 Learning fuzzy rule base etc

Limitations of Genetic Algorithms


o Genetic algorithms are not efficient algorithms for solving simple problems.
o It does not guarantee the quality of the final solution to a problem.
o Repetitive calculation of fitness values may generate some computational challenges.
Genetic Algorithms: An Illustrative Example
Let us understand genetic algorithms better through an example.
We will be solving a simple optimization problem step by step to understand the
concept of the algorithm.
Let us assume the expression mentioned below is satisfied for the optimal values
of a and b using a genetic algorithm.
The expression is:
2a^2 + b= 57.

We will be comparing it with an objective function, therefore the expression can


be written as:
f(a,b) = 2a^2 + b- 57.

From our prior knowledge, we know that the value of the function is assumed to be
zero. This function is called the objective function and is used to estimate the
values of a and b so that the value itself decreases to zero.
So to begin with, we start by guessing the initial sets of values for a and b. It may
or may not include the optimal values. We refer to these values as chromosomes,
and this process is called “population initialization”. The set of a and b, [a,b] is
referred to as the population. As we are taking the example of the optimization
problem, we take six sets of a and b values generated between 1 and 10.
The next step, step 2, is to compute the value of the objective function for each
chromosome, this method is known as “selection”. We call it selection as we select
the fittest chromosomes from the population of subsequent operations.

The fitness values are used to discard the chromosomes that have low fitness
values. It is done so to allow the succeeding generations to survive.
The following selection method is widely used:
Roulette wheel method
Roulette wheel refers to a pie plot. In this, the value of each pie is expressed in
terms of fitness probability. Chromosomes that produce low fitness value have a
high fitness probability, implying these two are very different factors and are
inversely related.
The chromosomes that have a higher fitness probability will have a greater chance
of being selected.
It is to be noted that the sum of all fitness probabilities always equals one.
Let us assume a scenario where we choose six random probabilities to generate six
values between 0 and 1, let us say the six probabilities are as follows:
For chromosome 1, Probability1 = 0.02.
For chromosome 2, Probability2 = 0.13.
For chromosome 3, Probability3 = 0.40.
For chromosome 4, Probability4 = 0.60.
For chromosome 5, Probability5 = 0.85.
For chromosome 6, Probability6 = 0.96.

So based on the position of the above probability values on the roulette wheel are
expressed. It is expressed as a cumulative fitness probability. Each segment in this
represents the corresponding chromosome.
The third step is known as crossover. In this step, the chromosomes are expressed
in terms of genes. We convert the values of a and b into bit strings.
The one parameter we need to keep in mind is the crossover parameter, it decides
which out of the six chromosomes, will be able to produce offspring.
Hypotheses Space Search
As already understood from our illustrative example, it is clear that genetic
algorithms employ a randomized beam search method to seek maximally fit
hypotheses. In the hypothesis space search method, we can see that the gradient
descent search in back-propagation moves smoothly from one hypothesis to
another. On the other hand, the genetic algorithm search can move much more
abruptly. It replaces the parent hypotheses with an offspring that can be very
different from the parent. Due to this reason, genetic algorithm search has lower
chances of it falling into the same kind of local minima that plaques the gradient
descent methods.
There is one practical difficulty that is often encountered in genetic algorithms, it
is crowding. Crowding can be defined as the phenomenon in which some
individuals that are more fit in comparison to others, reproduce quickly, therefore
the copies of this individual take over a larger fraction of the population. Most of
the strategies used in the genetic algorithms are inspired by biological evolution.
One such other strategy used is fitness sharing, in which the measured fitness of an
individual is decreased by the presence of another individual of a similar kind. The
third method is to restrict all the individuals to combine to form offspring. To
better understand we can say that by allowing individuals of the same kind to
recombine, clusters of similar individuals are formed, forming multiple subspecies
in the population.
Another method would be to spatially distribute individuals and allow only nearby
individuals to combine.

Population evolution and schema theorem.


The schema theorem of Holland is used to mathematically characterize the
evolution over time of the population with respect to time. It is based on the
concept of schema. So, what is schema? Schema is any string composed of 0s, and
1s, and *s, where * represents null, so a schema 0*10, is the same as 0010 and
0110. The schema theorem characterizes the evolution within a genetic algorithm
on the basis of the number of instances representing each schema. Let us assume
the m(s, t) to denote the number of instances of schema denoted by ‘s’, in the
population at the time ‘t’, the expected value in the schema theorem is described as
m(s, t+1), in terms of m(s, t), and the other parameters of the population, schema,
and GA.

In a genetic algorithm, the evolution of the population depends on the selection


step, the recombination step, and the mutation step. The schema theorem is one of
the most widely used theorems in the characterization of population evolution
within a genetic algorithm. If it fails to consider the positive effects of crossover
and mutation, it is in a way incomplete. There are many other recent theoretical
analyses that have been proposed, many of these analogies are based on models
such as Markov chain models and the statistical mechanical model.

Genetic programming
In artificial intelligence, genetic programming (GP) is a technique of evolving
programs, starting from a population of unfit (usually random) programs, fit for a
particular task by applying operations analogous to natural genetic processes to the
population of programs.
The operations are: selection of the fittest programs for reproduction (crossover)
and mutation according to a predefined fitness measure, usually proficiency at the
desired task. The crossover operation involves swapping random parts of selected
pairs (parents) to produce new and different offspring that become part of the new
generation of programs. Mutation involves substitution of some random part of a
program with some other random part of a program. Some programs not selected
for reproduction are copied from the current generation to the new generation.
Then the selection and other operations are recursively applied to the new
generation of programs.
Typically, members of each new generation are on average more fit than the
members of the previous generation, and the best-of-generation program is often
better than the best-of-generation programs from previous generations.
Termination of the evolution usually occurs when some individual program reaches
a predefined proficiency or fitness level.
It may and often does happen that a particular run of the algorithm results in
premature convergence to some local maximum which is not a globally optimal or
even good solution. Multiple runs (dozens to hundreds) are usually necessary to
produce a very good result. It may also be necessary to have a large starting
population size and variability of the individuals to avoid pathologies.
GP evolves computer programs, traditionally represented in memory as tree
structures. Trees can be easily evaluated in a recursive manner. Every tree node
has an operator function and every terminal node has an operand, making
mathematical expressions easy to evolve and evaluate. Thus traditionally GP
favors the use of programming languages that naturally embody tree structures (for
example, Lisp; other functional programming languages are also suitable).
Non-tree representations have been suggested and successfully implemented, such
as linear genetic programming which suits the more traditional imperative
languages. The commercial GP software Discipulus uses automatic induction
of binary machine code ("AIM") to achieve better performance. µGP uses directed
multigraphs to generate programs that fully exploit the syntax of a given assembly
language. Multi expression programming uses Three-address code for encoding
solutions. Other program representations on which significant research and
development have been conducted include programs for stack-based virtual
machines, and sequences of integers that are mapped to arbitrary programming
languages via grammars. Cartesian genetic programming is another form of GP,
which uses a graph representation instead of the usual tree based representation to
encode computer programs.
Most representations have structurally non-effective code (introns). Such non-
coding genes may seem to be useless because they have no effect on the
performance of any one individual. However, they alter the probabilities of
generating different offspring under the variation operators, and thus alter the
individual's variational properties. Experiments seem to show faster convergence
when using program representations that allow such non-coding genes, compared
to program representations that do not have any non-coding genes.

Crossover

In Genetic Programming two fit individuals are chosen from the population to be
parents for one or two children. In tree genetic programming, these parents are
represented as inverted lisp like trees, with their root nodes at the top. In subtree
crossover in each parent a subtree is randomly chosen. In the root donating parent
the chosen subtree is removed and replaced with a copy of the randomly chosen
subtree from the other parent, to give a new child tree.
Sometimes two child crossover is used, in which case the removed subtree (in the
animation on the left) is not simply deleted but is copied to a copy of the second
parent (here on the right) replacing (in the copy) its randomly chosen subtree. Thus
this type of subtree crossover takes two fit trees and generates two child trees.

Sequential Covering Algorithms


Sequential Covering is a popular algorithm based on Rule-Based Classification used for
learning a disjunctive set of rules. The basic idea here is to learn one rule, remove the
data that it covers, then repeat the same process. In this process, In this way, it covers
all the rules involved with it in a sequential manner during the training phase.
The Sequential Learning algorithm takes care of to some extent, the low coverage
problem in the Learn-One-Rule algorithm covering all the rules in a sequential manner.
Working on the Algorithm:
The algorithm involves a set of ‘ordered rules’ or ‘list of decisions’ to be made.
Step 1 – create an empty decision list, ‘R’.
Step 2 – ‘Learn-One-Rule’ Algorithm
It extracts the best rule for a particular class ‘y’, where a rule is defined as: (Fig.2)
General Form of Rule

In the beginning,
Step 2.a – if all training examples ∈ class ‘y’, then it’s classified as positive example.
Step 2.b – else if all training examples ∉ class ‘y’, then it’s classified as negative
example.
Step 3 – The rule becomes ‘desirable’ when it covers a majority of the positive
examples.
Step 4 – When this rule is obtained, delete all the training data associated with that rule.
(i.e. when the rule is applied to the dataset, it covers most of the training data, and has
to be removed)
Step 5 – The new rule is added to the bottom of decision list, ‘R’. (Fig.3)

Below, is a visual representation describing the working of the algorithm.

 Let us understand step by step how the algorithm is working in the example shown in
Fig.4.
 First, we created an empty decision list. During Step 1, we see that there are three
sets of positive examples present in the dataset. So, as per the algorithm, we
consider the one with maximum no of positive example. Once we cover these 6
positive examples, we get our first rule R 1, which is then pushed into the decision list
and those positive examples are removed from the dataset.
 Now, we take the next majority of positive examples and follow the same process
until we get rule R2. (Same for R3)
 In the end, we obtain our final decision list with all the desirable rules.
Sequential Learning is a powerful algorithm for generating rule-based classifiers in
Machine Learning. It uses ‘Learn-One-Rule’ algorithm as its base to learn a sequence of
disjunctive rules.

First-Order Inductive Learner (FOIL) Algorithm

First-Order Logic:
All expressions in first-order logic are composed of the following attributes:
1. constants — e.g. tyler, 23, a
2. variables — e.g. A, B, C
3. predicate symbols — e.g. male, father (True or False values only)
4. function symbols — e.g. age (can take on any constant as a value)
5. connectives — e.g. ∧, ∨, ¬, →, ←
6. quantifiers — e.g. ∀, ∃
Term: It can be defined as any constant, variable or function applied to any term.
e.g. age(bob)
Literal: It can be defined as any predicate or negated predicate applied to any
terms. e.g. female(sue), father(X, Y)

First Order Inductive Learner (FOIL)


In machine learning, first-order inductive learner (FOIL) is a rule-based learning
algorithm. It is a natural extension of SEQUENTIAL-COVERING and LEARN-ONE-
RULE algorithms. It follows a Greedy approach.
Inductive Learning:
Inductive learning analyzing and understanding the evidence and then using it to
determine the outcome. It is based on Inductive Logic.

Fig 1: Inductive Logic

Algorithm Involved
FOIL(Target predicate, predicates, examples)

• Pos ← positive examples


• Neg ← negative examples
• Learned rules ← {}
• while Pos, do
//Learn a NewRule
– NewRule ← the rule that predicts target-predicate with no preconditions

– NewRuleNeg ← Neg
– while NewRuleNeg, do
Add a new literal to specialize NewRule
1. Candidate_literals ← generate candidates for newRule based on
Predicates
2. Best_literal ←
argmaxL∈Candidate literalsFoil_Gain(L,NewRule)

3. add Best_literal to NewRule preconditions


4. NewRuleNeg ← subset of NewRuleNeg that satisfies NewRule
preconditions
– Learned rules ← Learned rules + NewRule
– Pos ← Pos − {members of Pos covered by NewRule}

• Return Learned rules


Working of the Algorithm:
In the algorithm, the inner loop is used to generate a new best rule. Let us consider an
example and understand the step-by-step working of the algorithm.

Fig 2 : FOIL Example

Say we are trying to predict the Target-predicate- GrandDaughter(x,y).


We perform the following steps: [Refer Fig 2]

Step 1 - NewRule = GrandDaughter(x,y)


Step 2 -
2.a - Generate the candidate_literals.
(Female(x), Female(y), Father(x,y), Father(y.x),
Father(x,z), Father(z,x), Father(y,z), Father(z,y))

2.b - Generate the respective candidate literal negations.


(¬Female(x), ¬Female(y), ¬Father(x,y), ¬Father(y.x),
¬Father(x,z), ¬Father(z,x), ¬Father(y,z), ¬Father(z,y))
Step 3 - FOIL might greedily select Father(x,y) as most promising, then
NewRule = GrandDaughter(x,y) ← Father(y,z) [Greedy approach]

Step 4 - Foil now considers all the literals from the previous step as well
as:
(Female(z), Father(z,w), Father(w,z), etc.) and their negations.

Step 5 - Foil might select Father(z,x), and on the next step Female(y)
leading to
NewRule = GrandDaughter (x,y) ← Father(y,z) ∧ Father(z,x) ∧ Female(y)

Step 6 - If this greedy approach covers only positive examples it terminates


the search for further better results.

FOIL now removes all positive examples covered by this new rule.
If more are left then the outer while loop continues.
FOIL: Performance Evaluation Measure
The performance of a new rule is not defined by its entropy measure (like
the PERFORMANCE method in Learn-One-Rule algorithm).
FOIL uses a gain algorithm to determine which new specialized rule to opt. Each rule’s
utility is estimated by the number of bits required to encode all the positive bindings.

where,
L is the candidate literal to add to rule R

p0 = number of positive bindings of R


n0 = number of negative bindings of R
p1 = number of positive binding of R + L
n1 = number of negative bindings of R + L
t = number of positive bindings of R also covered by R + L
FOIL Algorithm is another rule-based learning algorithm that extends on the Sequential
Covering + Learn-One-Rule algorithms and uses a different Performance metrics (other
than entropy/information gain) to determine the best rule possible.

Inverting Resolution

Resolution is a theorem proving technique that proceeds by building refutation proofs, i.e., proofs
by contradictions. It was invented by a Mathematician John Alan Robinson in the year 1965.

Resolution is used, if there are various statements are given, and we need to prove a conclusion
of those statements. Unification is a key concept in proofs by resolutions. Resolution is a single
inference rule which can efficiently operate on the conjunctive normal form or clausal form.

Clause: Disjunction of literals (an atomic sentence) is called a clause. It is also known as a unit
clause.

Conjunctive Normal Form: A sentence represented as a conjunction of clauses is said to


be conjunctive normal form or CNF.
The resolution inference rule:
The resolution rule for first-order logic is simply a lifted version of the propositional rule.
Resolution can resolve two clauses if they contain complementary literals, which are assumed to
be standardized apart so that they share no variables.

Where li and mj are complementary literals.

This rule is also called the binary resolution rule because it only resolves exactly two literals.

Example:
We can resolve two clauses which are given below:

[Animal (g(x) V Loves (f(x), x)] and [¬ Loves(a, b) V ¬Kills(a, b)]

Where two complimentary literals are: Loves (f(x), x) and ¬ Loves (a, b)

These literals can be unified with unifier θ= [a/f(x), and b/x] , and it will generate a resolvent
clause:

[Animal (g(x) V ¬ Kills(f(x), x)].

Steps for Resolution:


1. Conversion of facts into first-order logic.
2. Convert FOL statements into CNF
3. Negate the statement which needs to prove (proof by contradiction)
4. Draw resolution graph (unification).

To better understand all the above steps, we will take an example in which we will apply
resolution.

Example:

a. John likes all kind of food.


b. Apple and vegetable are food
c. Anything anyone eats and not killed is food.
d. Anil eats peanuts and still alive
e. Harry eats everything that Anil eats.
Prove by resolution that:
f. John likes peanuts.

Step-1: Conversion of Facts into FOL

In the first step we will convert all the given statements into its first order logic.

Step-2: Conversion of FOL into CNF

In First order logic resolution, it is required to convert the FOL into CNF as CNF form makes easier
for resolution proofs.

o Eliminate all implication (→) and rewrite

a. ∀x ¬ food(x) V likes(John, x)
b. food(Apple) Λ food(vegetables)
c. ∀x ∀y ¬ [eats(x, y) Λ ¬ killed(x)] V food(y)
d. eats (Anil, Peanuts) Λ alive(Anil)
e. ∀x ¬ eats(Anil, x) V eats(Harry, x)
f. ∀x¬ [¬ killed(x) ] V alive(x)
g. ∀x ¬ alive(x) V ¬ killed(x)
h. likes(John, Peanuts).
o Move negation (¬)inwards and rewrite

a. ∀x ¬ food(x) V likes(John, x)
b. food(Apple) Λ food(vegetables)
c. ∀x ∀y ¬ eats(x, y) V killed(x) V food(y)
d. eats (Anil, Peanuts) Λ alive(Anil)
e. ∀x ¬ eats(Anil, x) V eats(Harry, x)
f. ∀x ¬killed(x) ] V alive(x)
g. ∀x ¬ alive(x) V ¬ killed(x)
h. likes(John, Peanuts).
o Rename variables or standardize variables

a. ∀x ¬ food(x) V likes(John, x)
b. food(Apple) Λ food(vegetables)
c. ∀y ∀z ¬ eats(y, z) V killed(y) V food(z)
d. eats (Anil, Peanuts) Λ alive(Anil)
e. ∀w¬ eats(Anil, w) V eats(Harry, w)
f. ∀g ¬killed(g) ] V alive(g)
g. ∀k ¬ alive(k) V ¬ killed(k)
h. likes(John, Peanuts).
o Eliminate existential instantiation quantifier by elimination.
In this step, we will eliminate existential quantifier ∃, and this process is known
as Skolemization. But in this example problem since there is no existential quantifier so all
the statements will remain same in this step.
o Drop Universal quantifiers.
In this step we will drop all universal quantifier since all the statements are not implicitly
quantified so we don't need it.

a) food(x) V likes(John, x)
b) food(Apple)
c) food(vegetables)

o eats Distribute conjunction ∧ over disjunction .


This step will not make any change in this problem.

Step-3: Negate the statement to be proved

In this statement, we will apply negation to the conclusion statements, which will be written as
¬likes(John, Peanuts)

Step-4: Draw Resolution graph:

d) (y, z) V killed(y) V food(z)


e) eats (Anil, Peanuts)
f) alive(Anil)
g) ¬ eats(Anil, w) V eats(Harry, w)
h) killed(g) V alive(g)
i) ¬ alive(k) V ¬ killed(k)
j) likes(John, Peanuts).
Now in this step, we will solve the problem by resolution tree using substitution. For the above
problem, it will be given as follows:

Hence the negation of the conclusion has been proved as a complete contradiction with the
given set of statements.

Explanation of Resolution graph:


o In the first step of resolution graph, ¬likes(John, Peanuts) , and likes(John, x) get
resolved(canceled) by substitution of {Peanuts/x}, and we are left with ¬ food(Peanuts)
o In the second step of the resolution graph, ¬ food(Peanuts) , and food(z) get resolved (canceled)
by substitution of { Peanuts/z}, and we are left with ¬ eats(y, Peanuts) V killed(y) .
o In the third step of the resolution graph, ¬ eats(y, Peanuts) and eats (Anil, Peanuts) get resolved
by substitution {Anil/y}, and we are left with Killed(Anil) .
o In the fourth step of the resolution graph, Killed(Anil) and ¬ killed(k) get resolve by
substitution {Anil/k}, and we are left with ¬ alive(Anil) .
o In the last step of the resolution graph ¬ alive(Anil) and alive(Anil) get resolved.
Reinforcement learning
Reinforcement learning is an area of Machine Learning. It is about taking suitable action to
maximize reward in a particular situation. It is employed by various software and machines to find
the best possible behaviour or path it should take in a specific situation. Reinforcement learning
differs from supervised learning in a way that in supervised learning the training data has the
answer key with it so the model is trained with the correct answer itself whereas in reinforcement
learning, there is no answer but the reinforcement agent decides what to do to perform the given
task. In the absence of a training dataset, it is bound to learn from its experience.
Example: The problem is as follows: We have an agent and a reward, with many hurdles in
between. The agent is supposed to find the best possible path to reach the reward. The following
problem explains the problem more easily.

The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward
that is the diamond and avoid the hurdles that are fired. The robot learns by trying all the
possible paths and then choosing the path which gives him the reward with the least hurdles.
Each right step will give the robot a reward and each wrong step will subtract the reward of the
robot. The total reward will be calculated when it reaches the final reward that is the diamond.
Main points in Reinforcement learning –

Input: The input should be an initial state from which the model will start
Output: There are many possible outputs as there are a variety of solutions to a particular
problem
Training: The training is based upon the input, The model will return a state and the user will
decide to reward or punish the model based on its output.
The model keeps continues to learn.
The best solution is decided based on the maximum reward.
The Learning Task :
Now that we understand the basic terminology, let’s talk about formalising this whole
process using a concept called a Markov Decision Process or MDP.

A Markov Decision Process (MDP) model contains:

1.
1.
 A set of possible world states S
 A set of possible actions A
 A real valued reward function R(s,a)
 A description T of each action’s effects in each state

Now, let us understand the markov or ‘memoryless’ property.

Any random process in which the probability of being in a given state depends only on
the previous state, is a markov process.

In other words, in the markov decision process setup, the environment’s response at
time t+1 depends only on the state and action representations at time t, and is
independent of whatever happened in the past.

1.
 St: State of the agent at time t
 At: Action taken by agent at time t
 Rt: Reward obtained at time t

The above diagram clearly illustrates the iteration at each time step wherein the agent
receives a reward Rt+1 and ends up in state St+1 based on its action At at a particular
state St. The overall goal for the agent is to maximise the cumulative reward it receives
in the long run. Total reward at any time instant t is given by:

where T is the final time step of the episode. In the above equation, we see that all
future rewards have equal weight which might not be desirable. That’s where an
additional concept of discounting comes into the picture. Basically, we define γ as a
discounting factor and each reward after the immediate reward is discounted by this
factor as follows:

For discount factor < 1, the rewards further in the future are getting diminished. This
can be understood as a tuning parameter which can be changed based on how much
one wants to consider the long term (γ close to 1) or short term (γ close to 0).

State Value Function: How good it is to be in a given state?

Can we use the reward function defined at each time step to define how good it is, to
be in a given state for a given policy? The value function denoted as v(s) under a
policy π represents how good a state is for an agent to be in. In other words, what
is the average reward that the agent will get starting from the current state under
policy π?

E in the above equation represents the expected reward at each state if the agent
follows policy π and S represents the set of all possible states.

Policy, as discussed earlier, is the mapping of probabilities of taking each possible


action at each state (π(a/s)). The policy might also be deterministic when it tells you
exactly what to do at each state and does not give probabilities.

Now, it’s only intuitive that ‘the optimum policy’ can be reached if the value function is
maximised for each state. This optimal policy is then given by:
Q-Learning

Let’s say that a robot has to cross a maze and reach the end point. There are mines, and the
robot can only move one tile at a time. If the robot steps onto a mine, the robot is dead. The
robot has to reach the end point in the shortest time possible.
The scoring/reward system is as below:

1. The robot loses 1 point at each step. This is done so that the robot takes the shortest path and
reaches the goal as fast as possible.

2. If the robot steps on a mine, the point loss is 100 and the game ends.

3. If the robot gets power , it gains 1 point.

4. If the robot reaches the end goal, the robot gets 100 points.
Now, the obvious question is: How do we train a robot to reach the end goal with the
shortest path without stepping on a mine?

So, how do we solve this?

Introducing the Q-Table


Q-Table is just a fancy name for a simple lookup table where we calculate the maximum expected
future rewards for action at each state. Basically, this table will guide us to the best action at each
state.
There will be four numbers of actions at each non-edge tile. When a robot is at a state it can
either move up or down or right or left.

So, let’s model this environment in our Q-Table.

In the Q-Table, the columns are the actions and the rows are the states.

Each Q-table score will be the maximum expected future reward that the robot will get if it takes
that action at that state. This is an iterative process, as we need to improve the Q-Table at each
iteration.

But the questions are:

 How do we calculate the values of the Q-table?

 Are the values available or predefined?


To learn each value of the Q-table, we use the Q-Learning algorithm.
Mathematics: the Q-Learning algorithm
Q-function
The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).
Using the above function, we get the values of Q for the cells in the table.
When we start, all the values in the Q-table are zeros.

There is an iterative process of updating the values. As we start to explore the environment, the
Q-function gives us better and better approximations by continuously updating the Q-values in
the table.

Temporal difference (TD) learning

refers to a class of model-free reinforcement learning methods which learn by bootstrapping


from the current estimate of the value function. These methods sample from the environment,
like Monte Carlo methods, and perform updates based on current estimates, like dynamic
programming methods.
While Monte Carlo methods only adjust their estimates once the final outcome is known, TD
methods adjust predictions to match later, more accurate, predictions about the future before the
final outcome is known. This is a form of bootstrapping, as illustrated with the following example:
"Suppose you wish to predict the weather for Saturday, and you have some model that
predicts Saturday's weather, given the weather of each day in the week. In the standard
case, you would wait until Saturday and then adjust all your models. However, when it is,
for example, Friday, you should have a pretty good idea of what the weather would be on
Saturday – and thus be able to change, say, Saturday's model before Saturday arrives
Temporal difference methods are related to the temporal difference model of animal learning
The TD algorithm has also received attention in the field of neuroscience. Researchers discovered
that the firing rate of dopamine neurons in the ventral tegmental area (VTA) and substantia
nigra (SNc) appear to mimic the error function in the algorithm. The error function reports back
the difference between the estimated reward at any given state or time step and the actual
reward received. The larger the error function, the larger the difference between the expected and
actual reward. When this is paired with a stimulus that accurately reflects a future reward, the
error can be used to associate the stimulus with the future reward.
Dopamine cells appear to behave in a similar manner. In one experiment measurements of
dopamine cells were made while training a monkey to associate a stimulus with the reward of
juice.[11] Initially the dopamine cells increased firing rates when the monkey received juice,
indicating a difference in expected and actual rewards. Over time this increase in firing back
propagated to the earliest reliable stimulus for the reward. Once the monkey was fully trained,
there was no increase in firing rate upon presentation of the predicted reward. Subsequently, the
firing rate for the dopamine cells decreased below normal activation when the expected reward
was not produced. This mimics closely how the error function in TD is used for reinforcement
learning. The following figure can explains more about TD.

Dynamic Programming Method

Important: to be able to apply DP to solve the RL problem, we need to know


the transition probability matrix, as well as the reward system.
This might not always be the case in real world problems!

As said we will proceed using iterative solution because the analytical one is
hard to obtain.
We start with an initial random value at each state, then we pick a random
policy to follow. The reason to have a policy is simply because in order to
compute any state-value function we need to know how the agent is behaving.
(If you are wondering aren’t we supposed to compute the v* and 𝜋* ?! be
patient)
So we start by randomly initializing the values of all states and we call the
resulting state-value function v0(s), (we call it a function because after
assigning the values to all states, the v0(s) will return the value at any state s).

Now we follow the policy 𝜋, and on each iteration we update the values of all
the states. After updating all the states (roughly speaking) we obtain a new
state-value function. The values of the states after each iteration will be closer
to the theoretical values given by v𝜋(s) (remember that we are dong this
because the theoretical/analytical solution v𝜋(s) is hard to get).

So as the iteration goes on we will get the functions v1(s), v2(s), v3(s),
….vk(s), we keep iterating until the absolute difference
|vk-1(s) - vk(s)| for any state s, is less than a number θ.
This θ is a number below which we consider that the function vk(s) has
converged enough towards the v𝜋(s).

Define the reward when moving from one state to the other, and the discount
factor gamma.
Click Step button to run the number of iterations under the “Steps to Iterate”.

you can make one step at a time, or several steps (to speed up processing).

As you will notice, after a certain number of steps the value at each state varies
a little. This is when you know vk(s) has become close to v𝜋(s).

State values are updates after each step:

The action values show the worthiness of each possible action:


Finally the policy grid show what is the best action to take:

It is not hard to see that at some states (or cells) there are more than one good
action to take.

Conclusion

Dynamic programming is one iterative alternative to a hard-to-get analytical


solution. However, it suffers from the major flaw, which is the necessity of
knowing the transition matrix as well as the rewards. This is not always
possible in a real life problem.

You might also like