Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Assignment 12

Introduction to Machine Learning


Prof. B. Ravindran
1. Statement 1: Empirical error is always greater than generalisation error.
Statement 2: Training data and test data have different underlying(true) distributions.
Choose the correct option:
(a) Statement 1 is true. Statement 2 is true. Statement 2 is the correct reason for statemnet
1.
(b) Statement 1 is true. Statement 2 is true. Statement 2 is not the correct reason for
statemnet 1.
(c) Statement 1 is true. Statement 2 is false.
(d) Both statements are false.
Sol. (d)
Empirical error can be greater than or less than generalisation error. In fact, it is typically
less than generalisation error since the model tends to perform better on the data it has seen
during training.
5
2. Let P (Ai ) = 2−i . Calculate the upper bound for P (
S
Ai ) using union bound (rounded to 3
i=1
decimal places).
(a) 0.937
(b) 0.984
(c) 0.969
(d) 1
Sol. (c)

P (A1 ∪ A2 ∪ A3 ∪ A4 ∪ A5 ) ≤P (A1 ) + P (A2 ) + P (A3 ) + P (A4 ) + P (A5 )


1 1 1 1 1
= + + + +
2 4 8 16 32
=0.5 + 0.25 + 0.125 + 0.0625 + 0.03125
=0.96875
≈0.969

3. Which of the following is/are the shortcomings of TD Learning that Q-learning resolves?
(a) TD learning cannot provide values for (state, action) pairs, limiting the ability to extract
an optimal policy directly.
(b) TD learning requires knowledge of the reward and transition functions, which is not
always available.
(c) TD learning is computationally expensive and slow compared to Q-learning.

1
(d) TD learning often suffers from high variance in value estimation, leading to unstable
learning.
(e) TD learning cannot handle environments with continuous state and action spaces effec-
tively.
Sol. (a), (b), (d)
Refer to the lectures.

4. Given 100 hypothesis functions, each trained with 106 samples, what is the lower bound on
the probability that there does not exist a hypothesis function with error greater than 0.1?
4
(a) 1 − 200e−2·10
4
(b) 1 − 100e10
2
(c) 1 − 200e10
2
(d) 1 − 200e−2·10

Sol.
k = 100
m = 106
γ = 0.1

2
.106
P (∄ hi s.t.|E(hi ) − Ẽ(hi )| > 0.1) ≥ 1 − 2.100.e−2.0.1
4
≥ 1 − 200e−2.10

5. The VC dimension of a pair of squares is:


(a) 3
(b) 4
(c) 5
(d) 6
Sol. (c)
For binary classification on 5 samples, there will exist at least one class for which the number
of samples is less than or equal to two. Use the two squares to bound these.

For 6 samples, arrange the samples in the following manner:


xox
oxo
Two squares cannot classify these.

2
COMPREHENSION:
For the rest of the questions, we will follow a simplistic game and see how a Reinforcement
Learning agent can learn to behave optimally in it.
This is our game:

At the start of the game, the agent is on the Start state and can choose to move left or right
at each turn. If it reaches the right end(RE), it wins and if it reaches the left end(LE), it loses.

Because we love maths so much, instead of saying the agent wins or loses, we will say that the
agent gets a reward of +1 at RE and a reward of -1 at LE. Then the objective of the agent is
simply to maximum the reward it obtains!
For each state, we define a variable that will store its value. The value of the state will help
the agent determine how to behave later. First we will learn this value.

Let V be the mapping from state to its value.


Initially,
V(LE) = -1
V(X1) = V(X2) = V(X3) = V(X4) = V(Start) = 0
V(RE) = +1
For each state S ∈ {X1, X2, X3, X4, Start}, with SL being the state to its immediate left and
SR being the state to its immediate right, repeat:

V (S) = 0.9 × max(V (SL ), V (SR ))


Till V converges (does not change for any state).

6. What is V(X4) after one application of the given formula?

(a) 1
(b) 0.9
(c) 0.81
(d) 0

Sol. (b)
V (X4) = 0.9 × max(V (X3), V (RE))
V (X4) = 0.9 × max(0, 1)
V (X4) = 0.9 × 1
V (X4) = 0.9

7. What is V(X1) after one application of given formula?

3
(a) -1
(b) -0.9
(c) -0.81
(d) 0
Sol. (d)
V (X4) = 0.9 × max(V (LE), V (X2))
V (X4) = 0.9 × max(−1, 0)
V (X4) = 0.9 × 0
V (X4) = 0
8. What is V(X1) after V converges?
(a) 0.54
(b) -0.9
(c) 0.63
(d) 0
Sol. (a)
This is the sequence of changes in V:
V (X4) = 0.9 → V (X3) = 0.81 → V (Start) = 0.72 → V (X2) = 0.63 → V (X1) = 0.54
Final value for X1 is 0.54.
9. The behavior of an agent is called a policy. Formally, a policy is a mapping from states to
actions. In our case, we have two actions: left and right. We will denote the action for our
policy as A.
Clearly, the optimal policy would be to choose action right in every state. Which of the
following can we use to mathematically describe our optimal policy using the learnt V?

For options (c) and (d), T is the transition function defined as: T (state, action) = next state.
(more than one options may apply)
(a) (
Left if V (SL ) > V (SR )
A=
Right otherwise
(b) (
Left if V (SR ) > V (SL )
A=
Right otherwise
(c)
A = arg max({V (T (S, a))})
a

(d)
A = arg min({V (T (S, a))})
a

Sol. (a), (c)


The Value function (V) is higher for states that is closer to the winning state. Therefore, we
take steps in the direction that maximizes V.

4
10. In games like Chess or Ludo, the transition function is known to us. But what about Counter
Strike or Mortal Combat or Super Mario? In games where we do not know T, we can only
query the game simulator with current state and action, and it returns the next state. This
means we cannot directly argmax or argmin for V(T(S,a)). Therefore, learning the value
function V is not sufficient to construct a policy. Which of these could we do to overcome this?
(more than 1 may apply)
Assume there exists a method to do each option. You have to judge whether doing it solves
the stated problem.
(a) Directly learn the policy.
(b) Learn a different function which stores value for state-action pairs (instead of only state
like V does).
(c) Learn T along with V.
(d) Run a random agent repeatedly till it wins. Use this as the winning policy.
Sol. (a), (b), (c)
(a) - If we learn the policy itself, problem solved.
(b) - Given a function Q(s, a), we can use policy: A = arg max({Q(S, a)}).
a
(c) - If we have T and V, we can do what we saw in the previous question.
(d) - If the agent learns a single sequence of actions as its policy, then it will fail when any one
of the states that it saw for that sequence changes, which can easily happen for a stochastic
environment (i.e. transitions are probabilistic).

You might also like