Professional Documents
Culture Documents
Reinforcement Learning Exam
Reinforcement Learning Exam
Reinforcement Learning Exam
741374
1. 1 ► 1 TD Learning
3. 3 ► 3 Dynamic Programming
4. 4 ► 4 Degree of bootstrapping
5. 7 ► 7 Monte Carlo
6. 9 ► 9 Exhaustive Search
QUESTION 4 [840] RLI Machine Learning ?Multiple-Choice[1] ► Grade: 10
What does the Markov Property state?
1. All the information needed to predict the future is contained in the state representation
2. All state-action-state transitions are deterministic
3. That stochastic policies are not required
4. That the state of the system is mathematically determined
QUESTION 5 [2170] RLI Terminology & Definitions ?Multiple-Choice[1] ► Grade: 10
When ______ we use the system dynamics (model) to generate simulated experience and use them to refit the value functions or the policy
1. Planning
2. Learning
3. Predicting
4. Controlling
Planning: We use the system dynamics (model) to generate simulated experience and use them to refit the value functions or the policy.
In a simplistic view, the difference between learning and planning is one from real experience generated by the environment and one from simulated experience by a model.
QUESTION 6 [1500] RLI Reinforcement Learning ?Multiple-Choice[1] ► Grade: 10
In a Reinforcement Learning approach, the agent is trying to maximize:
1. the expected total discounted reward
2. the rewards
3. the return
4. the policy gradients
5. the options for actions
QUESTION 7 [1410] RLI Value Functions and Bellman Equations ?Multiple-Choice[1] ► Grade: 10
Select the equation that correctly relates q∗to v∗using four-argument function p
1. q ∗ (s, a) = ∑ s ,r p(s ′ , r|a, s)[r + v ∗ (s ′ )]
′
2. q ∗ (s, a) = ∑ ′
s ,r
′
p(s , r|a, s)γ[r + v ∗ (s )]
′
3. q ∗ (s, a) = ∑ ′
s ,r
′
p(s , r|a, s)[r + γv ∗ (s )]
′
2. Policy-Based
3. Actor-Critic
4. Monte Carlo Agent
5. Temporal Differences Method
6. Dyna Model
7. Action-Based
QUESTION 11 [1950] RLI Terminology & Definitions ?Multiple-Choice[1] ► Grade: 10
ai.twilightparadox.com:2024/test_submit/ 1/6
4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.741374
Usually denoted as A(s,a), this function is a measure of how much is a certain action a good or bad decision given a certain state. It is defined mathematically as:
where r(s,a) is the expected reward of action a from state s, and r(s) is the expected reward of the entire state s, before an action was selected. It can also be viewed as:
...where R(a) is the expected overall reward of action a, t is the number of steps taken (how many actions were selected overall), N(a) is the number of times action a was selected and c is a configurable hyperparameter. This method is
also referred to sometimes as “exploration through optimism”, as it gives less-explored actions a higher value, encouraging the model to select them.
QUESTION 15 [1945] RLI Terminology & Definitions ?Multiple-Choice[1] ► Grade: 10
When attempting to solve a Reinforcement Learning problem, there are two main methods one can choose from: calculating the Value Functions or Q-Values of each state and choosing actions according to those, or
directly compute a policy which defines the probabilities each action should be taken depending on the current state, and act according to it. The algorithms that combine the two methods in order to create a more robust
method are called...
1. Actor-Critic
2. Monte Carlo methods
3. Dynamic programming algorithms
4. Integrated algorithms
QUESTION 16 [1455] RLI Dynamic Programming ?Multiple-Choice[1] ► Grade: 10
The value of any state under an optimal policy is ______ the value of that state under a non-optimal policy.
1. Strictly greater than
2. Greater than or equal to
3. Strictly less than
4. Less than or equal to
QUESTION 17 [2010] RLI Terminology & Definitions ?Multiple-Choice[1] ► Grade: 10
As Reinforcement Learning tasks have no pre-generated training sets which they can learn from, the Agent must keep records of all the state-transitions it encountered so it can learn from them later. The memory-buffer
used to store this is often referred to as ...
1. Experience Replay
2. History
3. Trajectory
4. Eligibility Traces
The memory-buffer used to store this is often referred to as Experience Replay. There are several types and architectures of these memory buffers — but some very common ones are the cyclic memory buffers (which makes sure the
Agent keeps training over its new behavior rather than things that might no longer be relevant) and reservoir-sampling-based memory buffers (which guarantees each state-transition recorded has an even probability to be inserted to the
buffer).
QUESTION 18 [1335] RLI MDPs ?Multiple-Choice[1] ► Grade: 10
How does the magnitude of the discount factor γ affect learning?
1. With a larger discount factor the agent is more far-sighted and considers rewards farther into the future.
2. The magnitude of the discount factor has no effect on the agent.
3. With a smaller discount factor the agent is more far-sighted and considers rewards farther into the future.
QUESTION 19 [1310] RLI Sequential Decision-Making ?Multiple-Choice[1] ► Grade: 10
If exploration is so great why could epsilon of 0.0 (a greedy agent) perform better than epsilon of 0.4?
1. Epsilon of 0.0 is greedy, thus it will always choose the optimal arm.
2. Epsilon of 0.4 doesn’t explore often enough to find the optimal action.
3. Epsilon of 0.4 explores too often that it takes many sub-optimal actions causing it to do worse over the long term.
ai.twilightparadox.com:2024/test_submit/ 2/6
4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.741374
QUESTION 20 [1415] RLI Value Functions and Bellman Equations ?Multiple-Choice[1] ► Grade: 10
Write a policy π∗in terms of q∗.
1. π ∗ (a|s) = q ∗ (s, a)
2. π ∗ (a|s) = max a q ∗ (s, a ′ )
′
3. π ∗ (a|s) = 1 if
′
a = argmax a ′ q ∗ (s, a ) else 0
QUESTION 21 [1400] RLI Value Functions and Bellman Equations ?Multiple-Choice[1] ► Grade: 10
Does adding a constant to all rewards change the set of optimal policies in continuing tasks?
1. Yes, adding a constant to all rewards changes the set of optimal policies.
2. No, as long as the relative differences between rewards remain the same, the set of optimal policies is the same.
QUESTION 22 [2325] RLI Model Free Learning ?Multiple-Choice[1] ► Grade: 10
Does MC use bootstrapping?
1. No
2. Yes
3. Yes, but only in "first visit" MC
4. Yes, but only in "every visit" MC
No, MC learns from complete episodes, no bootstrapping.
QUESTION 23 [2175] RLI Terminology & Definitions ?Multiple-Choice[1] ► Grade: 10
ai.twilightparadox.com:2024/test_submit/ 3/6
4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.741374
...or when no uncertainty exists (meaning, probabilities are either 1 or 0):
∗ ∗ ′ ′
Q (s, a) = r(s, a) + γmax a ′ Q (s , a )]
s a
formulates...
1. the policy gradient theorem
2. the expected likelihood of the loss function
3. the gradient of the optimal policy
4. the gradient-based formulation of the Bellman equations
5. None of these options
QUESTION 27 [1375] RLI Value Functions and Bellman Equations ?Multiple-Choice[1] ► Grade: 10
A policy is a function which maps ______ to ______.
1. Actions to probability distributions over values.
2. Actions to probabilities.
3. States to values.
4. States to probability distributions over actions.
5. States to actions.
QUESTION 28 [880] RLI General concepts ?Multiple-Choice[1] ► Grade: 10
In machine learning, we basically differentiate between ______ and ______ feedback
1. evaluative, instructive
2. evaluative, non-evaluative
3. instructive, declarative
4. informative, declarative
QUESTION 29 [1450] RLI Value Functions and Bellman Equations ?Multiple-Choice[1] ► Grade: 10
Consider an episodic MDP with one state and two actions (left and right). The left action has stochastic reward 1 with probability p and 3 with probability 1-p. The right action has stochastic reward 0 with probability q and
10 with probability 1-q. What relationship between p and q makes the actions equally optimal?
1. 7 + 3p = -10q
2. 7 + 3p = 10q
3. 7 + 2p = 10q
4. 13 + 3p = -10q
5. 13 + 2p = 10q
6. 13 + 2p = -10q
7. 13 + 3p = 10q
8. 7 + 2p = -10q
QUESTION 30 [1980] RLI Terminology & Definitions ?Multiple-Choice[1] ► Grade: 10
The use of a Reinforcement Learning algorithm with a deep neural network as an approximator for the learning part. This is usually done in order to cope with problems where the number of
possible states and actions scales fast, and an exact solution in no longer feasible.
1. Deep Reinforcement Learning
2. DQN
3. Deep Q-Learning
4. DDPG (Deep Deterministic Policy Gradient)
QUESTION 31 [1430] RLI Value Functions and Bellman Equations ?Multiple-Choice[1] ► Grade: 10
Every finite Markov decision process has ______.
1. A stochastic optimal policy
2. A unique optimal policy
3. A deterministic optimal policy
4. An asymptotic optimal value function
The book of Sutton and Barto takes the existence of optimal policies for granted and let this question unanswered. The proof is not for the heart-fainted however. To prove the existence of the optimal policy in finite MDPs, one must prove
the following two statements:
(1) The set of the Bellman optimality equations has solutions, and
(2) One of its solutions has values greater than or equal to the values of the other solutions in all states.
The proof involves showing that the Bellman optimality operator is a contraction mapping in infinity norm and using the Banach fixed-point theorem
QUESTION 32 [2050] RLI Terminology & Definitions ?Multiple-Choice[1] ► Grade: 10
______ is an off-policy Reinforcement Learning algorithm
1. Q-Learning
2. SARSA
3. Monte Carlo Method
4. REINFORCE
Q-Learning is an off-policy Reinforcement Learning algorithm, considered as one of the very basic ones. In its most simplified form, it uses a table to store all Q-Values of all possible state-action pairs possible. It updates this table using
the Bellman equation, while action selection is usually made with an ε-greedy policy.
In its simplest form (no uncertainties in state-transitions and expected rewards), the update rule of Q-Learning is:
′ ′
Q(s, a) = r(s, a) + γmax a ′ Q(s , a )
2. association of the actions with the situations in which they are best
3. exhaustive exploration of the search space
4. correlation between the states and their associated actions
QUESTION 35 [1360] RLI MDPs ?Multiple-Choice[1] ► Grade: 10
What is the reward hypothesis?
1. That all of what we mean by goals and purposes can be well thought of as the minimization of the expected value of the cumulative sum of a received scalar signal (called reward)
2. That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)
3. Ignore rewards and find other signals.
4. Always take the action that gives you the best reward at that point.
QUESTION 36 [925] RLI Reinforcement Learning ?Multiple-Choice[2] ► Grade: 20
For variance reduction we could use techniques such as (mark 2 answers):
1. Control variates
ai.twilightparadox.com:2024/test_submit/ 4/6
4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.741374
2. Importance sampling
3. Over-sampling
4. Under-sampling
QUESTION 37 [2035] RLI Terminology & Definitions ?Multiple-Choice[1] ► Grade: 10
Algorithms which concern about the policy which yielded past state-action decisions are referred to as ______ algorithms
1. On-Policy
2. Greedy
3. Iterative
4. model-based
Every Reinforcement Learning algorithm must follow some policy in order to decide which actions to perform at each state. Still, the learning procedure of the algorithm doesn’t have to take into account that policy while learning.
Algorithms which concern about the policy which yielded past state-action decisions are referred to as on-policy algorithms, while those ignoring it are known as off-policy.
A well known off-policy algorithm is Q-Learning, as its update rule uses the action which will yield the highest Q-Value, while the actual policy used might restrict that action or choose another. The on-policy variation of Q-Learning is
known as Sarsa, where the update rule uses the action chosen by the followed policy.
QUESTION 38 [2085] RLI Terminology & Definitions ?Multiple-Choice[1] ► Grade: 10
______ is a learning method which combines both Dynamic Programming and Monte Carlo principles; it learns “on the fly” similarly to Monte Carlo, yet updates its estimates like Dynamic Programming.
1. Temporal-Difference (TD)
2. Bootstrapping
3. Dynamic Random Sampling
4. Value iteration
Temporal Difference is a learning method which combines both Dynamic Programming and Monte Carlo principles; it learns “on the fly” similarly to Monte Carlo, yet updates its estimates like Dynamic Programming. One of the simplest
Temporal Difference algorithms it known as one-step TD or TD(0). It updates the Value Function according to the following update rule:
where V is the Value Function, s is the state, r is the reward, γ is the discount factor, α is a learning rate, t is the time-step and the ‘=’ sign is used as an update operator and not equality. The term found in the squared brackets is known as
the temporal difference error.
QUESTION 39 [2955] RLI Markov decision problems and Dynamic programming ?Multiple-Choice[1] ► Grade: 10
Optimality criterion defines the criterion to select between ______
1. actions
2. discounted rewards
3. Q-values
4. Policy values
QUESTION 40 [2270] RLI MDP Planning ?Multiple-Choice[1] ► Grade: 10
Q-Learning is an example of _____ learning algorithm.
1. model-free
2. model-based
3. on-policy
4. dynamic programming
Q-Learning is an example of model-free learning algorithm.
It does not assume that agent knows anything about the state-transition and reward models. However, the agent will discover what are the good and bad actions by trial and error.
The basic idea of Q-Learning is to approximate the state-action pairs Q-function from the samples of Q(s, a) that we observe during interaction with the environment. This approach is known as Temporal Differences (sometimes aka Time-
Difference) Learning.
QUESTION 41 [810] RLI Reinforcement Learning ?Multiple-Choice[1] ► Grade: 10
The Bellman Optimality Equation is:
1. q ∗ (s, a)
′ ′ ′
= ∑ p(s , r|s, a) [r + γmaxq ∗ (s , a )]
′
a
′
s ,r
2. v π (s)
′ ′
= ∑ π(a|s) ∑ p(s , r|s, a) [r + γv π (s )]
a ′
s ,r
′
a s ,r
4. neither of these
QUESTION 42 [2020] RLI Terminology & Definitions ?Multiple-Choice[1] ► Grade: 10
Reinforcement Learning tasks have no pre-generated training sets which they can learn from — they create their own experience and learn “on the fly”. To be able to do so, the Agent needs to try many different actions in
many different states in order to try and learn all available possibilities and find the path which will maximize its overall reward and use the information it learned to do so. The latter is known as ______, as the Agent uses its
knowledge to maximize the rewards it receives.
1. Exploitation
2. Policy execution
3. Utilization
4. Reward cropping
Reinforcement Learning tasks have no pre-generated training sets which they can learn from. They create their own experience and learn on the fly.
To be able to do so, the Agent needs to try many different actions in many different states in order to try and learn all available possibilities and find the path which will maximize its overall reward; this is known as Exploration, as the
Agent explores the Environment.
On the other hand, if all the Agent will do is explore, it will never maximize the overall reward. It must also use the information it learned to do so. This is known as Exploitation, as the Agent exploits its knowledge to maximize the rewards
it receives.
The trade-off between the two is one of the greatest challenges of Reinforcement Learning problems, as the two must be balanced in order to allow the Agent to both explore the environment enough, but also exploit what it learned and
repeat the most rewarding path it found.
QUESTION 43 [1330] RLI MDPs ?Multiple-Choice[1] ► Grade: 10
If the reward is always +1 what is the sum of the discounted infinite return when γ < 1
k
Gt = ∑ γ R t+k+1
k=0
1. G t =
1−γ
1
γ
2. G t =
1−γ
3. Infinity.
4. G t = 1 ∗ γ k
QUESTION 44 [1405] RLI Value Functions and Bellman Equations ?Multiple-Choice[1] ► Grade: 20
Select the equation that correctly relates v∗to q∗. Assume π is the uniform random policy.
1. v ∗ (s) = max a q ∗ (s, a)
2. v ∗ (s) ′
= ∑ a,r,s ′ π(a|s)p(s , r|s, a)[r + q ∗ (s )]
′
3. v ∗ (s) = ∑
a,r,s
′
′
π(a|s)p(s , r|s, a)[r + γq ∗ (s )]
′
4. v ∗ (s) = ∑
a,r,s
′
′
π(a|s)p(s , r|s, a)q ∗ (s )
′
ai.twilightparadox.com:2024/test_submit/ 5/6
4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.741374
© 2024 - José Manuel Rey
ai.twilightparadox.com:2024/test_submit/ 6/6