Professional Documents
Culture Documents
Reinforcement Learning: Markov Decision Process
Reinforcement Learning: Markov Decision Process
For discrete spaces, the random variable may take a countable number of
states.
Example: rolling a fair die, where the random variable X representing the
die side facing upwards, can take the discrete values 1, 2, 3, 4, 5, 6, thus:
1
p(X = 1) = p(X = 2) = · · · = p(X = 6) =
6
For continuous spaces, the random variable may take one of infinitely
many continuoum values. Example: Gaussian distribution.
Given two random variables X and Y , the joint probability describes the
probability that X takes on the value x and Y takes the value y. It is
denoted as:
p(x, y) = p(X = x and Y = y)
If the random variables are independent (e.g., rolling two dice, rolling one
die twice),
p(x, y) = p(x)p(y).
The conditional probability describes the probability that X takes the
value x, given that Y has for sure assumed the value y. For p(y) > 0,
the conditional probability is defined as:
p(x, y)
p(x | y) =
p(y)
Example: the throw of a fair die: each value (1, . . . , 6) has a probability
of 61 . The expected value is:
N 6 6
X X 1 1X 1 + ... + 6 21
E(X) = p(xi ) xi = i= i= = = 3.5
i=1 i=1
6 6 i=1 6 6
150
2.8 1 2 3 4 5 6
2.96 1.5
2.95 1
1 2 3 4 5 6
3 0.20 2
4 0.15
1.5
5 0.10
6 0.10 1
sum 1.0 1 2 3 4 5 6
s0 s1
1.0, −1 a2
S0 , A0 , R1 , S1 , A1 , R2 , S2 , . . .
Recall that the expected value for a random variable X with discrete
probability is
XN
E(X) = xi p(xi )
i=1
X p(s′ , r|s, a)
r(s, a, s′ ) = E(Rt | St−1 = s, At−1 = a, St = s′ ) = r
p(s′ |s, a)
r∈R
π(a|s) = Pr{At = a | St = s}
The value function of state s under policy π, denoted vπ (s) and also
called state-value function for policy π, is the expected return when
starting in s and following the policy π thereafter:
T
!
X
vπ (s) = Eπ (Gt | St = s) = Eπ γ k Rt+k+1 | St = s , ∀ s ∈ S
k=0
Similarly, the action-value function for policy π denoted qπ (s, a), is the
expected return when starting in s, taking action a, and following the
policy π thereafter:
qπ (s, a) = Eπ (Gt | St = s, At = a)