Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

ST.

SOLDIER INSTITUTE OF
ENGINEERING & TECHNOLOGY
SESSION – (2019-2023)
PRACTICAL FILE OF ARTIFICIAL
INTELLIGENCE LAB
(BTCS 605-18)
INDEX
SR. NO. AIM PAGE NO. SIGNATURE
1. Write a programme to conduct 1 – 12
uninformed and informed
search.
2. Write a programme to conduct 13 – 17
game search.
3. Write a programme to 18 – 23
construct a Bayesian network
from given data.
4. Write a programme to infer 24 – 28
from the Bayesian network.

5. Write a programme to run 29 – 33


value and policy iteration in a
grid world.
6. Write a programme to do 34 - 40
reinforcement learning in a
grid world
EXPERIMENT N0. 1
AIM : Write a program to conduct uniformed and informed search.
THEORY :
Uninformed Search Algorithm
Uninformed search is a class of general-purpose search algorithms which
operates in brute force-way. Uninformed search algorithms do not have
additional information about state or search space other than how to
traverse the tree, so it is also called blind search.
Following are the various types of uninformed search algorithms:
1.Breadth-first Search
2.Depth-first Search
1. Breadth-first Search:
Breadth-first search is the most common search strategy for traversing a
tree or graph. This algorithm searches breadthwise in a tree or graph, so
it is called breadth-first search. BFS algorithm starts searching from the
root node of the tree and expands all successor node at the current level
before moving to nodes of next level. The breadth-first search algorithm
is an example of a general-graph search algorithm. Breadth-first search
implemented using FIFO queue data structure.
Code for BFS :
#include<bits/stdc++.h>
using namespace std;
// This class represents a directed graph using
// adjacency list representation
class Graph
{
int V; // No. of vertices
// Pointer to an array containing adjacency
// lists
vector<list<int>> adj;
public:
Graph(int V); // Constructor
// function to add an edge to graph
void addEdge(int v, int w);
// prints BFS traversal from a given source s
void BFS(int s);
};
Graph::Graph(int V)
{
this->V = V;
adj.resize(V);
}
void Graph::addEdge(int v, int w)
{
adj*v+.push_back(w); // Add w to v’s list.
}
void Graph::BFS(int s)
{
// Mark all the vertices as not visited
vector<bool> visited;
visited.resize(V,false);
// Create a queue for BFS
list<int> queue;
// Mark the current node as visited and enqueue it
visited[s] = true;
queue.push_back(s);
while(!queue.empty())
{
// Dequeue a vertex from queue and print it
s = queue.front();
cout << s << " ";
queue.pop_front();
// Get all adjacent vertices of the dequeued
// vertex s. If a adjacent has not been visited,
// then mark it visited and enqueue it
for (auto adjecent: adj[s])
{
if (!visited[adjecent])
{
visited[adjecent] = true;
queue.push_back(adjecent);
}
}
}
}

// Driver program to test methods of graph class


int main()
{
// Create a graph given in the above diagram
Graph g(4);
g.addEdge(0, 1);
g.addEdge(0, 2);
g.addEdge(1, 2);
g.addEdge(2, 0);
g.addEdge(2, 3);
g.addEdge(3, 3);

cout << "Following is Breadth First Traversal "


<< "(starting from vertex 2) \n";
g.BFS(2);

return 0;
}
2. Depth-first Search
Depth-first search isa recursive algorithm for traversing a tree or graph
data structure. It is called the depth-first search because it starts from the
root node and follows each path to its greatest depth node before
moving to the next path. DFS uses a stack data structure for its
implementation. The process of the DFS algorithm is similar to the BFS
algorithm
EXPERIMENT NO. 1

AIM : Write a program to conduct uniformed and informed search

Output:
Following is Breadth First Traversal (starting from vertex 2)
2031
Code for DFS:
#include <bits/stdc++.h>
using namespace std;
// Graph class represents a directed graph
// using adjacency list representation
class Graph {
public:
map<int, bool> visited;
map<int, list<int> > adj;
// function to add an edge to graph
void addEdge(int v, int w);
// DFS traversal of the vertices
// reachable from v
void DFS(int v);
};
void Graph::addEdge(int v, int w)
{
adj*v+.push_back(w); // Add w to v’s list.
}
void Graph::DFS(int v)
{
// Mark the current node as visited and
// print it
visited[v] = true;
cout << v << " ";
// Recurse for all the vertices adjacent
// to this vertex
list<int>::iterator i;
for (i = adj[v].begin(); i != adj[v].end(); ++i)
if (!visited[*i])
DFS(*i);
}
// Driver code
int main()
{
// Create a graph given in the above diagram
Graph g;
g.addEdge(0, 1);
g.addEdge(0, 2);
g.addEdge(1, 2);
g.addEdge(2, 0);
g.addEdge(2, 3);
g.addEdge(3, 3);

cout << "Following is Depth First Traversal"


" (starting from vertex 2) \n";
g.DFS(2);

return 0;
}
Informed Search:
Informed Search algorithms have information on the goal state which
helps in more efficient searching. This information is obtained by a
function that estimates how close a state is to the goal state.
In the informed search main algorithm which is given below:
1.A* Search Algorithm
A* search is the most commonly known form of best-first search. It uses
heuristic function h(n), and cost to reach the node n from the start state
g(n). It has combined features of UCS and greedy best-first search, by
which it solve the problem efficiently. A* search algorithm finds the
shortest path through the search space using the heuristic function. This
search algorithm expands less search tree and provides optimal result
faster. A* algorithm is similar to UCS except that it uses g(n)+h(n) instead
of g(n).
Code for A* algorithm
#include <iostream>
#include "source/AStar.hpp"
int main()
{
AStar::Generator generator;
// Set 2d map size.
generator.setWorldSize({25, 25});
// You can use a few heuristics : manhattan, euclidean or octagonal.
generator.setHeuristic(AStar::Heuristic::euclidean);
generator.setDiagonalMovement(true);
std::cout << "Generate path ... \n";
// This method returns vector of coordinates from target to source.
auto path = generator.findPath({0, 0}, {20, 20});
for(auto& coordinate : path) {
std::cout << coordinate.x << " " << coordinate.y << "\n";
}
}
OUTPUT OF A* algorithm
OUTPUT :
Following is Depth First Traversal (starting from vertex 2)
2013
EXPERIMENT N0. 2
AIM : Write a program to conduct game search.
THEORY : Game playing was one of the first tasks undertaken in Artificial
Intelligence. Game theory has its history from 1950, almost from the days
when computers became programmable. The very first game that is been
tackled in AI is chess. Initiators in the field of game theory in AI were
Konard Zuse (the inventor of the first programmable computer and the
first programming language), Claude Shannon (the inventor of
information theory), Norbert Wiener (the creator of modern control
theory), and Alan Turing. Since then, there has been a steady progress in
the standard of play, to the point that machines have defeated human
champions (although not every time) in chess and backgammon, and are
competitive in many other games.
Types of Game
1. Perfect Information Game: In which player knows all the possible
moves of himself and opponent and their results.
E.g. Chess. 2
2. Imperfect Information Game: In which player does not know all the
possible moves of the opponent.
E.g. Bridge since all the cards are not visible to player
Mini-Max Algorithm in Artificial Intelligence:
Mini-max algorithm is a recursive or backtracking algorithm which is used
in decision-making and game theory. It provides an optimal move for the
player assuming that opponent is also playing optimally. Mini-Max
algorithm uses recursion to search through the game-tree. Min-Max
algorithm is mostly used for game playing in AI. Such as Chess, Checkers,
tic-tac-toe, go, and various tow-players game. This Algorithm computes
the minimax decision for the current state .In this algorithm two players
play the game, one is called MAX and other is called MIN. Both the
players fight it as the opponent player gets the minimum benefit while
they get the maximum benefit. Both Players of the game are opponent of
each other, where MAX will select the maximized value and MIN will
select the minimized value. The minimax algorithm performs a depth-first
search algorithm for the exploration of the complete game tree. The
minimax algorithm proceeds all the way down to the terminal node of
the tree, then backtrack the tree as the recursion.
Code for minmax algorithm :
// A simple C++ program to find
// maximum score that
// maximizing player can get.
#include<bits/stdc++.h>
using namespace std;

// Returns the optimal value a maximizer can obtain.


// depth is current depth in game tree.
// nodeIndex is index of current node in scores[].
// isMax is true if current move is
// of maximizer, else false
// scores[] stores leaves of Game tree.
// h is maximum height of Game tree
int minimax(int depth, int nodeIndex, bool isMax,
int scores[], int h)
{

// Terminating condition. i.e


// leaf node is reached

if (depth == h)

return scores[nodeIndex];

// If current move is maximizer,

// find the maximum attainable

// value

if (isMax)

return max(minimax(depth+1, nodeIndex*2, false, scores, h),

minimax(depth+1, nodeIndex*2 + 1, false, scores, h));

// Else (If current move is Minimizer), find the minimum

// attainable value

else

return min(minimax(depth+1, nodeIndex*2, true, scores, h),

minimax(depth+1, nodeIndex*2 + 1, true, scores, h));

// A utility function to find Log n in base 2

int log2(int n)

return (n==1)? 0 : 1 + log2(n/2);

// Driver code

int main()
{

// The number of elements in scores must be

// a power of 2.

int scores[] = {3, 5, 2, 9, 12, 5, 23, 23};

int n = sizeof(scores)/sizeof(scores[0]);

int h = log2(n);

int res = minimax(0, 0, true, scores, h);

cout << "The optimal value is : " << res << endl;

return 0;

}
EXPERIMENT N0. 2

AIM : Write a program to conduct game search

Output:

The optimal value is: 12


EXPERIMENT N0. 3
AIM : Write a program to construct a Bayesian network from given data
THEORY: A Bayesian network is a directed acyclic graph in which each
edge corresponds to a conditional dependency, and each node
corresponds to a unique random variable. Bayesian network consists of
two major parts: a directed acyclic graph and a set of conditional
probability distributions
1. The directed acyclic graph is a set of random variables represented
by nodes.
2. The conditional probability distribution of a node (random variable)
is defined for every possible outcome of the preceding causal
node(s).
For illustration, consider the following example. Suppose we attempt to
turn on our computer, but the computer does not start
(observation/evidence). We would like to know which of the possible
causes of computer failure is more likely. In this simplified illustration, we
assume only two possible causes of this misfortune: electricity failure and
computer malfunction.
The corresponding directed acyclic graph is depicted in below figure

Electricity failure Computer


causes malfunction

evidence
Computer failure
The goal is to calculate the posterior conditional probability distribution
of each of the possible unobserved causes given the observed evidence,
i.e. P [Cause | Evidence].
Data Set:
Title: Heart Disease Databases
The Cleveland database contains 76 attributes, but all published
experiments refer to using a
subset of 14 of them. In particular, the Cleveland database is the only one
that has been used
by ML researchers to this date. The "Heartdisease" field refers to the
presence of heart disease in the patient. It is integer valued from 0 (no
presence) to 4.
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303
Attribute Information:
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type

-anginal pain
4. trestbps: resting blood pressure (in mm Hg on admission to the
hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results

-T wave abnormality (T wave inversions and/or ST


elevation
or depression of > 0.05 mV)
hypertrophy by
Estes'
criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11.slope: the slope of the peak exercise ST segment
ping

12. ca = number of major vessels (0-3) colored by flourosopy


13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14.Heartdisease: It is integer valued from 0 (no presence) to 4. Diagnosis
of heart disease (angiographic disease status)
Some instance from the dataset:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal Heart
disease
63 1 1 145 1 1 2 150 0 2.3 3 0 6 0
67 1 4 160 4 0 2 108 1 1.5 2 3 3 2
67 1 4 120 4 0 2 129 1 2.6 2 2 7 1
41 0 2 130 2 0 2 172 0 1.4 1 O 3 0
62 0 4 140 4 0 2 160 0 3.6 3 2 3 3
60 1 4 130 206 0 2 132 1 2.4 2 2 7 4
Code :
import numpy as np
import csv
import pandas as pd
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator
from pgmpy.inference import VariableElimination
#read Cleveland Heart Disease data
heartDisease = pd.read_csv('heart.csv')
heartDisease = heartDisease.replace('?',np.nan)
#display the data
print('Few examples from the dataset are given below')
print(heartDisease.head())
#Model Bayesian Network
Model=BayesianModel([('age','trestbps'),('age','fbs'),
('sex','trestbps'),('exang','trestbps'),('trestbps','heartdise
ase'),('fbs','heartdisease'),('heartdisease','restecg'),
('heartdisease','thalach'),('heartdisease','chol')])
#Learning CPDs using Maximum Likelihood Estimators
print('\n Learning CPD using Maximum likelihood estimators')
model.fit(heartDisease,estimator=MaximumLikelihoodEstimator)
# Inferencing with Bayesian Network
print('\n Inferencing with Bayesian Network:')
HeartDisease_infer = VariableElimination(model)
#computing the Probability of HeartDisease given Age
print('\n 1. Probability of HeartDisease given Age=30')
q=HeartDisease_infer.query(variables=['heartdisease'],evidence
={'age':28})
print(q['heartdisease'])
#computing the Probability of HeartDisease given cholesterol
print('\n 2. Probability of HeartDisease given cholesterol=100')
q=HeartDisease_infer.query(variables=['heartdisease'],evidence
={'chol':100})
print(q['heartdisease'])
EXPERIMENT N0. 3
AIM : Write a program to construct a Bayesian network from given data
Output:
Few examples from the dataset are given below
Age sex cp trestbps ... slope ca thal heartdisease
0 63 1 1 145 … 3 0 6 0
1 67 1 4 160 … 2 3 3 2
2 67 1 4 120 … 2 2 7 1
3 37 1 3 130 … 3 0 3 0
4 41 0 2 130 … 1 0 3 0
EXPERIMENT N0. 4
AIM: Write a program to infer from the Bayesian network.
THEORY : An acyclic directed graph is used to create a Bayesian network,
which is a probability model. It’s factored by utilizing a single conditional
probability distribution for each variable in the model, whose distribution
is based on the parents in the graph. The simple principle of probability
underpins Bayesian models. So, first, let’s define conditional probability
and joint probability distribution.
Conditional Probability
Conditional probability is a measure of the likelihood of an event
occurring provided that another event has already occurred (through
assumption, supposition, statement, or evidence). If A is the event of
interest and B is known or considered to have occurred, the conditional
probability of A given B is generally stated as P(A|B) or, less frequently,
PB(A) if A is the event of interest and B is known or thought to have
occurred. This can also be expressed as a percentage of the likelihood of
B crossing with A:

Joint Probability
The chance of two (or more) events together is known as the joint
probability. The sum of the probabilities of two or more random variables
is the joint probability distribution. For example, the joint probability of
events A and B is expressed formally as:
The letter P is the first letter of the alphabet (A and B).
The upside-down capital “U” operator or, in some situations, a comma “,”
represents the “and” or conjunction.
P(A ^ B)
P(A, B)
By multiplying the chance of event A by the likelihood of event B, the
combined probability for occurrences A and B is calculated.
Posterior Probability
In Bayesian statistics, the conditional probability of a random occurrence
or an ambiguous assertion is the conditional probability given the
relevant data or background. “After taking into account the relevant
evidence pertinent to the specific subject under consideration,”
“posterior” means in this case. The probability distribution of an
unknown quantity interpreted as a random variable based on data from
an experiment or survey is known as the posterior probability
distribution.
Inferencing with Bayesian Network
In this demonstration, we’ll use Bayesian Networks to solve the well-
known Monty Hall Problem. Let me explain the Monty Hall problem to
those of you who are unfamiliar with it:
This problem entails a competition in which a contestant must choose
one of three doors, one of which conceals a price. The show’s host
(Monty) unlocks an empty door and asks the contestant if he wants to
swap to the other door after the contestant has chosen one. The decision
is whether to keep the current door or replace it with a new one. It is
preferable to enter by the other door because the price is more likely to
be higher. To come out from this ambiguity let’s model this with a
Bayesian network.
CODE :
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD
import networkx as nx
import pylab as plt
# Defining Bayesian Structure
model = BayesianNetwork([('Guest', 'Host'), ('Price', 'Host')])
# Defining the CPDs:
cpd_guest = TabularCPD('Guest', 3, [[0.33], [0.33], [0.33]])
cpd_price = TabularCPD('Price', 3, [[0.33], [0.33], [0.33]])
cpd_host = TabularCPD('Host', 3, [[0, 0, 0, 0, 0.5, 1, 0, 1, 0.5],
[0.5, 0, 1, 0, 0, 0, 1, 0, 0.5],
[0.5, 1, 0, 1, 0.5, 0, 0, 0, 0]],
evidence=['Guest', 'Price'], evidence_card=[3, 3])
# Associating the CPDs with the network structure.
model.add_cpds(cpd_guest, cpd_price, cpd_host)
model.check_model()
# Infering the posterior probability
from pgmpy.inference import VariableElimination
infer = VariableElimination(model)
posterior_p = infer.query(['Host'], evidence={'Guest': 2, 'Price': 2})
print(posterior_p)
EXPERIMENT N0. 4
AIM: Write a program to infer from the Bayesian network
OUTPUT :
EXPERIMENT NO. 5
AIM: Write a program to run value and policy iteration in a grid world
THEORY : Value Iteration
With the tools we have explored until now, a new question arises: why
do we need to consider an initial policy at all? The idea of the value
iteration algorithm is that we can compute the value function without a
policy. Instead of letting the policy, π, dictate which actions are selected,
we will select those actions that maximize the expected reward:

CODE FOR VALUE ITERATION :


def valueIteration(self, gridWorld, gamma = 1):
self.resetPolicy() # ensure empty policy before calling evaluatePolicy
V_old = None
V_new = np.repeat(0, gridWorld.size())
convergedCellIndices = np.zeros(0)
while len(convergedCellIndices) != len(V_new):
V_old = V_new
V_new = self.evaluatePolicySweep(gridWorld, V_old, gamma,
convergedCellIndices)
convergedCellIndices = self.findConvergedCells(V_old, V_new)
greedyPolicy = findGreedyPolicy(V_new, gridWorld, self.gameLogic)
self.setPolicy(greedyPolicy)
self.setWidth(gridWorld.getWidth())
self.setHeight(gridWorld.getHeight())
return(V_new)
POLICY ITERATION :
A simple strategy for this is a greedy algorithm that iterates over all the
cells in the grid and then chooses the action that maximizes the expected
reward according to the value function.
This approach implicitly determines the action-value function, which is
defined as
Qπ(s,a)=∑s′Pass′*Rass′+γVπ(s′)+Qπ(s,a)=∑s′Pss′a*Rss′a+γVπ(s′)+
The improvePolicy function determines the value function of a policy (if
it’s not available yet) and then calls findGreedyPolicy to identify the
optimal action for every state:
def improvePolicy(policy, gridWorld, gamma = 1):
policy = copy.deepcopy(policy) # dont modify old policy
if len(policy.values) == 0:
# policy needs to be evaluated first
policy.evaluatePolicy(gridWorld)
greedyPolicy = findGreedyPolicy(policy.getValues(), gridWorld, \
policy.gameLogic, gamma)
policy.setPolicy(greedyPolicy)
return policy

def findGreedyPolicy(values, gridWorld, gameLogic, gamma = 1):


# create a greedy policy based on the values param
stateGen = StateGenerator()
greedyPolicy = [Action(Actions.NONE)] * len(values)
for (i, cell) in enumerate(gridWorld.getCells()):
gridWorld.setActor(cell)
if not cell.canBeEntered():
continue
maxPair = (Actions.NONE, -np.inf)
for actionType in Actions:
if actionType == Actions.NONE:
continue
proposedCell = gridWorld.proposeMove(actionType)
if proposedCell is None:
# action is nonsensical in this state
continue
Q = 0.0 # action-value function
proposedStates = stateGen.generateState(gridWorld, actionType, cell)
for proposedState in proposedStates:
actorPos = proposedState.getIndex()
transitionProb = gameLogic.getTransitionProbability(cell, proposedState,
actionType, gridWorld)
reward = gameLogic.R(cell, proposedState, actionType)
expectedValue = transitionProb * (reward + gamma * values[actorPos])
Q += expectedValue
if Q > maxPair[1]:
maxPair = (actionType, Q)
gridWorld.unsetActor(cell) # reset state
greedyPolicy[i] = Action(maxPair[0])
return greedyPolicy
EXPERIMENT NO. 5
AIM : Write a program to run value and policy iteration in a grid world.
OUTPUT :
OUTPUT :
EXPERIMENT NO : 6
AIM : Write a program to do reinforcement learning in a grid world .

THEORY : Reinforcement Learning (RL) involves decision making


under uncertainty which tries to maximize return over successive
states.There are four main elements of a Reinforcement Learning system:
a policy, a reward signal, a value function. The policy is a mapping from
the states to actions or a probability distribution of actions. Every action
the agent takes results in a numerical reward. The agent’s sole purpose is
to maximize the reward in the long run.
Reinforcement Learning involves decision making under uncertainty
which tries to maximize return over successive states.There are four main
elements of a Reinforcement Learning system: a policy, a reward signal, a
value function. The policy is a mapping from the states to actions or a
probability distribution of actions. Every action the agent takes results in
a numerical reward. The agent’s sole purpose is to maximize the reward
in the long run. The reward indicates the immediate return, a value
function specifies the return in the long run. Value of a state is the
expected reward that an agent can accrue The agent/robot takes an
action in At in state St and moves to state S’t anf gets a reward Rt+1 as
shown
An agent will seek to maximize the overall return as it transition across states
The expected return can be expressed as
where is the expected return in time t and the discounted expected
return in time t+1

A policy is a mapping from states to probabilities of selecting each possible action. If the
agent is following policy at time t, then is the probability that = a if = s.

The value function of a state s under a policy , denoted , is the expected return when
starting in s and following thereafter

This can be written as

Similarly the action value function gives the expected return when taking an action ‘a’ in
state ‘s’

These are Bellman’s equation for the state value function

The Bellman equations give the equation for each of the state

The Bellman optimality equations give the optimal policy of choosing specific actions in
specific states to achieve the maximum reward and reach the goal efficiently. They are given
as

The Bellman equations cannot be used directly in goal directed problems and dynamic
programming is used instead where the value functions are computed iteratively
In the problem below the Maze has 2 end states as shown in the corner. There are four
possible actions in each state up, down, right and left. If an action in a state takes it out of
the grid then the agent remains in the same state. All actions have a reward of -1 while the
end states have a reward of 0
This is shown as

where the reward for any transition is Rt=−1Rt=−1 except the transition to the end states at
the corner which have a reward of 0. The policy is a uniform policy with all actions being
equi-probable with a probability of 1/4 or 0.25

1. Gridworld-1
In [1]:
import numpy as np
import random
In [2]:
gamma = 1 # discounting rate
gridSize = 4
rewardValue = -1
terminationStates = [[0,0], [gridSize-1, gridSize-1]]
actions = [[-1, 0], [1, 0], [0, 1], [0, -1]]
numIterations = 1000

The action value provides the next state for a given action in a state and the accrued reward
In [3]:
def actionValue(initialPosition,action):
if initialPosition in terminationStates:
finalPosition = initialPosition
reward=0
else:
#Compute final position
finalPosition = np.array(initialPosition) + np.array(action)
reward= rewardValue
# If the action moves the finalPosition out of the grid, stay in same cell
if -1 in finalPosition or gridSize in finalPosition:
finalPosition = initialPosition
reward= rewardValue

#print(finalPosition)
return finalPosition, reward

1a. Bellman Update


In [4]:
# Initialize valueMap and valueMap1
valueMap = np.zeros((gridSize, gridSize))
valueMap1 = np.zeros((gridSize, gridSize))
states = [[i, j] for i in range(gridSize) for j in range(gridSize)]
In [5]:
def policy_evaluation(numIterations,gamma,theta,valueMap):
for i in range(numIterations):
delta=0
for state in states:
weightedRewards=0
for action in actions:
finalPosition,reward = actionValue(state,action)
weightedRewards += 1/4* (reward + gamma *
valueMap[finalPosition[0],finalPosition][1])
valueMap1[state[0],state[1]]=weightedRewards
delta =max(delta,abs(weightedRewards-valueMap[state[0],state[1]]))
valueMap = np.copy(valueMap1)
if(delta < 0.01):
print(valueMap)
break
In [6]:
valueMap = np.zeros((gridSize, gridSize))
valueMap1 = np.zeros((gridSize, gridSize))
states = [[i, j] for i in range(gridSize) for j in range(gridSize)]
policy_evaluation(1000,1,0.001,valueMap)
[[ 0. -13.89528403 -19.84482978 -21.82635535]
[-13.89528403 -17.86330422 -19.84586777 -19.84482978]
[-19.84482978 -19.84586777 -17.86330422 -13.89528403]
[-21.82635535 -19.84482978 -13.89528403 0. ]]
In [7]:
valueMap = np.zeros((gridSize, gridSize))
valueMap1 = np.zeros((gridSize, gridSize))
states = [[i, j] for i in range(gridSize) for j in range(gridSize)]
pi = np.ones((gridSize,gridSize))/4
pi1 = np.chararray((gridSize, gridSize))
pi1[:] = 'a'
In [8]:
# Compute the value state function for the Grid
def policy_evaluate(states,actions,gamma,valueMap):
#print("iterations=",i)
for state in states:
weightedRewards=0
for action in actions:
finalPosition,reward = actionValue(state,action)
weightedRewards += 1/4* (reward + gamma * valueMap[finalPosition[0],finalPosition][1])
# Set the computed weighted rewards to valueMap1
valueMap1[state[0],state[1]]=weightedRewards
# Copy to original valueMap
valueMap = np.copy(valueMap1)
return(valueMap)
In [9]:
def argmax(q_values):
idx=np.argmax(q_values)
return(np.random.choice(np.where(a==a[idx])[0].tolist()))

# Compute the best action in each state


def greedify_policy(state,pi,pi1,gamma,valueMap):
q_values=np.zeros(len(actions))
for idx,action in enumerate(actions):
finalPosition,reward = actionValue(state,action)
q_values[idx] += 1/4* (reward + gamma * valueMap[finalPosition[0],finalPosition][1])
# Find the index of the action for which the q_value is
idx=q_values.argmax()
pi[state[0],state[1]]=idx
if(idx == 0):
pi1[state[0],state[1]]='u'
elif(idx == 1):
pi1[state[0],state[1]]='d'
elif(idx == 2):
pi1[state[0],state[1]]='r'
elif(idx == 3):
pi1[state[0],state[1]]='l'

In [10]:
def improve_policy(pi, pi1,gamma,valueMap):
policy_stable = True
for state in states:
old = pi[state].copy()
# Greedify policy for state
greedify_policy(state,pi,pi1,gamma,valueMap)
if not np.array_equal(pi[state], old):
policy_stable = False
print(pi)
print(pi1)
return pi, pi1, policy_stable
In [11]:
def policy_iteration(gamma, theta):
valueMap = np.zeros((gridSize, gridSize))
pi = np.ones((gridSize,gridSize))/4
pi1 = np.chararray((gridSize, gridSize))
pi1[:] = 'a'
policy_stable = False
print("here")
while not policy_stable:
valueMap = policy_evaluate(states,actions,gamma,valueMap)
pi,pi1, policy_stable = improve_policy(pi,pi1, gamma,valueMap)
return valueMap, pi,pi1
In [12]:
theta=0.1
valueMap, pi,pi1 = policy_iteration(gamma, theta)
[[0. 3. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 2. 0.]]
[[b'u' b'l' b'u' b'u']
[b'u' b'u' b'u' b'u']
[b'u' b'u' b'u' b'd']
[b'u' b'u' b'r' b'u']]
[[0. 3. 3. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 1.]
[0. 2. 2. 0.]]
[[b'u' b'l' b'l' b'u']
[b'u' b'u' b'u' b'd']
[b'u' b'u' b'd' b'd']
[b'u' b'r' b'r' b'u']]
[[0. 3. 3. 1.]
[0. 0. 1. 1.]
[0. 0. 1. 1.]
[0. 2. 2. 0.]]
[[b'u' b'l' b'l' b'd']
[b'u' b'u' b'd' b'd']
[b'u' b'u' b'd' b'd']
[b'u' b'r' b'r' b'u']]
[[0. 3. 3. 1.]
[0. 0. 1. 1.]
[0. 0. 1. 1.]
[0. 2. 2. 0.]]
[[b'u' b'l' b'l' b'd']
[b'u' b'u' b'd' b'd']
[b'u' b'u' b'd' b'd']
[b'u' b'r' b'r' b'u']]
EXPERIMENT NO : 6
AIM : Write a program to do reinforcement learning in a grid world
output:
The valueMap shows the optimal path from any state

You might also like