Reinforcement Learning Exam

4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.
741374
Answer Sheet Print Certificate
EXAM 2024-04-16 14:52:46.741374 - RLI_BCSAI_3_2024

Stats: [EXAMS Mode] Exam 505.0/505/505
QUESTION 1 [1365] RLI MDPs ?Multiple-Choice[2] ► Grade: 10
Imagine, an agent is in a maze-like gridworld. You would like the agent to find the goal, as quickly as possible. You give the agent a reward of +1 when it reaches the goal and the discount rate is 1.0, because this is an
episodic task. When you run the agent it finds the goal, but does not seem to care how long it takes to complete each episode. How could you fix this? (Select all that apply)
1.  Give the agent a reward of 0 at every time step so it wants to leave.
2.   Set a discount rate less than 1 and greater than 0, like 0.9.
3.   Give the agent -1 at each time step.

4.  Give the agent a reward of +1 at every time step.
QUESTION 2 [1485] RLI Dynamic Programming ?Multiple-Choice[1] ► Grade: 10
Why are dynamic programming algorithms considered planning methods?
1.  They compute optimal value functions.
2.  They learn from trial and error interaction.
3.   They use a model to improve the policy.
QUESTION 3 [935] RLI Reinforcement Learning ?Matching[9] ► Grade: 40
Place the following concepts onto the diagram:
1.  1 ► 1 TD Learning
2.  2 ► 2 Sampled or Expected updates
3.  3 ► 3 Dynamic Programming
4.  4 ► 4 Degree of bootstrapping
5.  7 ► 7 Monte Carlo
6.  9 ► 9 Exhaustive Search
QUESTION 4 [840] RLI Machine Learning ?Multiple-Choice[1] ► Grade: 10
What does the Markov Property state?
1.   All the information needed to predict the future is contained in the state representation
2.  All state-action-state transitions are deterministic
3.  That stochastic policies are not required
4.  That the state of the system is mathematically determined
QUESTION 5 [2170] RLI Terminology & Definitions ?Multiple-Choice[1] ► Grade: 10
When ______ we use the system dynamics (model) to generate simulated experience and use them to refit the value functions or the policy
1.   Planning
2.  Learning
3.  Predicting
4.  Controlling
Planning: We use the system dynamics (model) to generate simulated experience and use them to refit the value functions or the policy.
In a simplistic view, the difference between learning and planning is one from real experience generated by the environment and one from simulated experience by a model.
QUESTION 6 [1500] RLI Reinforcement Learning ?Multiple-Choice[1] ► Grade: 10
In a Reinforcement Learning approach, the agent is trying to maximize:
1.   the expected total discounted reward
2.  the rewards
3.  the return
4.  the policy gradients
5.  the options for actions
QUESTION 7 [1410] RLI Value Functions and Bellman Equations ?Multiple-Choice[1] ► Grade: 10
Select the equation that correctly relates q∗to v∗using four-argument function p
1.  q ∗ (s, a) = ∑ s ,r p(s ′ , r|a, s)[r + v ∗ (s ′ )]
′
2.  q ∗ (s, a) = ∑ ′
s ,r
′
p(s , r|a, s)γ[r + v ∗ (s )]
′
3.   q ∗ (s, a) = ∑ ′
s ,r
′
p(s , r|a, s)[r + γv ∗ (s )]
′

When may you want to formulate a problem as episodic?
1.  When the agent-environment interaction does not naturally break into sequences. Each new episode begins dependently of how the previous episode ended.
2.   When the agent-environment interaction naturally breaks into sequences. Each sequence begins independently of how the previous episode ended.
We use the Monte Carlo method to evaluate the Q-value function of the current policy and find the optimal option by locating the action with the maximum Q-value functions. This is called ______
1.   Monte Carlo control
2.  Monte Carlo prediction
3.  Monte Carlo estimation
4.  Monte Carlo sampling
QUESTION 10 [2550] RLI Intro: DavidSilver ?Multiple-Choice[3] ► Grade: 10
3 Categories of RL Agents ...
1.   Value-Based
2.   Policy-Based
3.   Actor-Critic
4.  Monte Carlo Agent
5.  Temporal Differences Method
6.  Dyna Model
7.  Action-Based
ai.twilightparadox.com:2024/test_submit/ 1/6
4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.741374
Usually denoted as A(s,a), this function is a measure of how much is a certain action a good or bad decision given a certain state. It is defined mathematically as:
A(s,a) = E[ r(s,a) - r(s) ]
where r(s,a) is the expected reward of action a from state s, and r(s) is the expected reward of the entire state s, before an action was selected. It can also be viewed as:
A(s,a) = Q(s,a) - V(s)
where Q(s,a) is the Q Value and V(s) is the Value function.

The name of this function is ...
1.   Advantage Function
2.  Reward gradient
3.  Asynchronous reward
4.  Eligibility trace
QUESTION 12 [1300] RLI Sequential Decision-Making ?Multiple-Choice[1] ► Grade: 10
What is the exploration/exploitation tradeoff?
1.   The agent wants to explore to get more accurate estimates of its values. The agent also wants to exploit to get more reward. The agent cannot, however, choose to do both simultaneously.
2.  The agent wants to explore the environment to learn as much about it as possible about the various actions. That way once it knows every arm’s true value it can choose the best one for the rest of the time.
3.  The agent wants to maximize the amount of reward it receives over its lifetime. To do so it needs to avoid the action it believes is worst to exploit what it knows about the environment. However to discover which arm is truly worst
it needs to explore different actions which potentially will lead it to take the worst action at times.
Reinforcement Learning tasks which are not made of episodes, but rather last forever. This tasks have no terminal states. For simplicity, they are usually assumed to be made of one never-ending episode.
1.   Continuous Tasks
2.  Infinite episodic task
3.  POMDP (partially observable Markov decision process)
4.  Ill-defined Markov Chains
______ is an exploration method which tries to ensure that each action is well explored by applying the idea that the less an action is selected, the less confident the Agent can be about its expected reward, and that
its exploitation phase might be harmed. [you have come across it when studying k-armed bandits]
1.   Upper Confident Bound (UCB)
2.  ε-greedy algorithm
3.  Stochastic Search
4.  Monte Carlo
UCB is an exploration method which tries to ensure that each action is well explored. Consider an exploration policy which is completely random — meaning, each possible action has the same chance of being selected. There is a
chance that some actions will be explored much more than others. The less an action is selected, the less confident the Agent can be about its expected reward, and the its exploitation phase might be harmed. Exploration by UCB takes
into account the number of times each action was selected, and gives extra weight to those less-explored.
Formalizing this mathematically, the selected action is picked by:
...where R(a) is the expected overall reward of action a, t is the number of steps taken (how many actions were selected overall), N(a) is the number of times action a was selected and c is a configurable hyperparameter. This method is
also referred to sometimes as “exploration through optimism”, as it gives less-explored actions a higher value, encouraging the model to select them.
When attempting to solve a Reinforcement Learning problem, there are two main methods one can choose from: calculating the Value Functions or Q-Values of each state and choosing actions according to those, or
directly compute a policy which defines the probabilities each action should be taken depending on the current state, and act according to it. The algorithms that combine the two methods in order to create a more robust
method are called...
1.   Actor-Critic
2.  Monte Carlo methods
3.  Dynamic programming algorithms
4.  Integrated algorithms
QUESTION 16 [1455] RLI Dynamic Programming ?Multiple-Choice[1] ► Grade: 10
The value of any state under an optimal policy is ______ the value of that state under a non-optimal policy.
1.  Strictly greater than
2.   Greater than or equal to
3.  Strictly less than
4.  Less than or equal to
As Reinforcement Learning tasks have no pre-generated training sets which they can learn from, the Agent must keep records of all the state-transitions it encountered so it can learn from them later. The memory-buffer
used to store this is often referred to as ...
1.   Experience Replay
2.  History
3.  Trajectory
4.  Eligibility Traces
The memory-buffer used to store this is often referred to as Experience Replay. There are several types and architectures of these memory buffers — but some very common ones are the cyclic memory buffers (which makes sure the
Agent keeps training over its new behavior rather than things that might no longer be relevant) and reservoir-sampling-based memory buffers (which guarantees each state-transition recorded has an even probability to be inserted to the
buffer).
How does the magnitude of the discount factor γ affect learning?
1.   With a larger discount factor the agent is more far-sighted and considers rewards farther into the future.
2.  The magnitude of the discount factor has no effect on the agent.
3.  With a smaller discount factor the agent is more far-sighted and considers rewards farther into the future.
QUESTION 19 [1310] RLI Sequential Decision-Making ?Multiple-Choice[1] ► Grade: 10
If exploration is so great why could epsilon of 0.0 (a greedy agent) perform better than epsilon of 0.4?
1.  Epsilon of 0.0 is greedy, thus it will always choose the optimal arm.
2.  Epsilon of 0.4 doesn’t explore often enough to find the optimal action.
3.   Epsilon of 0.4 explores too often that it takes many sub-optimal actions causing it to do worse over the long term.
4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.741374
Write a policy π∗in terms of q∗.
1.  π ∗ (a|s) = q ∗ (s, a)
2.  π ∗ (a|s) = max a q ∗ (s, a ′ )
′
3.   π ∗ (a|s) = 1 if
′
a = argmax a ′ q ∗ (s, a ) else 0
Does adding a constant to all rewards change the set of optimal policies in continuing tasks?
1.  Yes, adding a constant to all rewards changes the set of optimal policies.
2.   No, as long as the relative differences between rewards remain the same, the set of optimal policies is the same.
QUESTION 22 [2325] RLI Model Free Learning ?Multiple-Choice[1] ► Grade: 10
Does MC use bootstrapping?
1.   No
2.  Yes
3.  Yes, but only in "first visit" MC
4.  Yes, but only in "every visit" MC
No, MC learns from complete episodes, no bootstrapping.
This is the backup diagram for ______

1.   Q-learning
2.  Expected SARSA
3.  Monte Carlo
4.  Optimal Policy
Q-Learning (off-policy). We learn the Q-value function by first taking an action (under a policy like epsilon-greedy) and observe the rewards R. Then we determine the next action with the best Q-value function.
QUESTION 24 [805] RLI Machine Learning ?Multiple-Choice[1] ► Grade: 10

A contextual bandit is:
1.   an associative search problem that extends the k-armed bandit model by making the decision conditional on the state of the environment
2.  a non-associative search problem that extends the k-armed bandit model by creating a context suitable for the decision making
3.  an associative search problem solved by discrete contextual approximations
4.  a non-associative search problem solved by discrete contextual approximations
This equation defines the relationships between a given state (or state-action pair) to its successors. While many forms exist, the most common one usually encountered in Reinforcement Learning tasks is the one for the
optimal Q-Value , which is given by:
∗ ′ ∗ ′ ′
Q (s, a) = ∑ s ′ ,r p(s , r|s, a)[r + γmax a ′ Q (s , a )]
4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.741374
...or when no uncertainty exists (meaning, probabilities are either 1 or 0):
∗ ∗ ′ ′
Q (s, a) = r(s, a) + γmax a ′ Q (s , a )]
where the asterisk sign indicates optimal value

1.   Bellman Equation
2.  Optimal Expected Returns
3.  Q-Learning equation
4.  Gradient descent local optima
The expresssion
∇J (θ) ∝ ∑ μ(s) ∑ q π (s, a)∇π(a|s, θ)
s a
formulates...
1.   the policy gradient theorem
2.  the expected likelihood of the loss function
3.  the gradient of the optimal policy
4.  the gradient-based formulation of the Bellman equations
5.  None of these options
A policy is a function which maps ______ to ______.
1.  Actions to probability distributions over values.
2.  Actions to probabilities.
3.  States to values.
4.  States to probability distributions over actions.
5.   States to actions.
QUESTION 28 [880] RLI General concepts ?Multiple-Choice[1] ► Grade: 10
In machine learning, we basically differentiate between ______ and ______ feedback
1.   evaluative, instructive
2.  evaluative, non-evaluative
3.  instructive, declarative
4.  informative, declarative
Consider an episodic MDP with one state and two actions (left and right). The left action has stochastic reward 1 with probability p and 3 with probability 1-p. The right action has stochastic reward 0 with probability q and
10 with probability 1-q. What relationship between p and q makes the actions equally optimal?
1.  7 + 3p = -10q
2.  7 + 3p = 10q
3.   7 + 2p = 10q
4.  13 + 3p = -10q
5.  13 + 2p = 10q
6.  13 + 2p = -10q
7.  13 + 3p = 10q
8.  7 + 2p = -10q
The use of a Reinforcement Learning algorithm with a deep neural network as an approximator for the learning part. This is usually done in order to cope with problems where the number of
possible states and actions scales fast, and an exact solution in no longer feasible.
1.   Deep Reinforcement Learning
2.  DQN
3.  Deep Q-Learning
4.  DDPG (Deep Deterministic Policy Gradient)
Every finite Markov decision process has ______.
1.  A stochastic optimal policy
2.   A unique optimal policy
3.  A deterministic optimal policy
4.  An asymptotic optimal value function
The book of Sutton and Barto takes the existence of optimal policies for granted and let this question unanswered. The proof is not for the heart-fainted however. To prove the existence of the optimal policy in finite MDPs, one must prove
the following two statements:
(1) The set of the Bellman optimality equations has solutions, and
(2) One of its solutions has values greater than or equal to the values of the other solutions in all states.
The proof involves showing that the Bellman optimality operator is a contraction mapping in infinity norm and using the Banach fixed-point theorem
______ is an off-policy Reinforcement Learning algorithm
1.   Q-Learning
2.  SARSA
3.  Monte Carlo Method
4.  REINFORCE
Q-Learning is an off-policy Reinforcement Learning algorithm, considered as one of the very basic ones. In its most simplified form, it uses a table to store all Q-Values of all possible state-action pairs possible. It updates this table using
the Bellman equation, while action selection is usually made with an ε-greedy policy.
In its simplest form (no uncertainties in state-transitions and expected rewards), the update rule of Q-Learning is:
′ ′
Q(s, a) = r(s, a) + γmax a ′ Q(s , a )

For episodic tasks
1.   the return is defined as a sum over a finite number of terms
2.  the return is defined as sum over an infinite number of terms
3.  the return cannot be defined explicitly
4.  the return is the maximum of the return of each individual task
Associative search tasks involve (mark 2 answers):
1.   trial-and-error learning to search for the best actions
2.   association of the actions with the situations in which they are best
3.  exhaustive exploration of the search space
4.  correlation between the states and their associated actions
What is the reward hypothesis?
1.  That all of what we mean by goals and purposes can be well thought of as the minimization of the expected value of the cumulative sum of a received scalar signal (called reward)
2.   That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)
3.  Ignore rewards and find other signals.
4.  Always take the action that gives you the best reward at that point.
For variance reduction we could use techniques such as (mark 2 answers):
1.   Control variates
4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.741374
2.   Importance sampling
3.  Over-sampling
4.  Under-sampling
Algorithms which concern about the policy which yielded past state-action decisions are referred to as ______ algorithms
1.   On-Policy
2.  Greedy
3.  Iterative
4.  model-based
Every Reinforcement Learning algorithm must follow some policy in order to decide which actions to perform at each state. Still, the learning procedure of the algorithm doesn’t have to take into account that policy while learning.
Algorithms which concern about the policy which yielded past state-action decisions are referred to as on-policy algorithms, while those ignoring it are known as off-policy.
A well known off-policy algorithm is Q-Learning, as its update rule uses the action which will yield the highest Q-Value, while the actual policy used might restrict that action or choose another. The on-policy variation of Q-Learning is
known as Sarsa, where the update rule uses the action chosen by the followed policy.
______ is a learning method which combines both Dynamic Programming and Monte Carlo principles; it learns “on the fly” similarly to Monte Carlo, yet updates its estimates like Dynamic Programming.
1.   Temporal-Difference (TD)
2.  Bootstrapping
3.  Dynamic Random Sampling
4.  Value iteration
Temporal Difference is a learning method which combines both Dynamic Programming and Monte Carlo principles; it learns “on the fly” similarly to Monte Carlo, yet updates its estimates like Dynamic Programming. One of the simplest
Temporal Difference algorithms it known as one-step TD or TD(0). It updates the Value Function according to the following update rule:
where V is the Value Function, s is the state, r is the reward, γ is the discount factor, α is a learning rate, t is the time-step and the ‘=’ sign is used as an update operator and not equality. The term found in the squared brackets is known as
the temporal difference error.
QUESTION 39 [2955] RLI Markov decision problems and Dynamic programming ?Multiple-Choice[1] ► Grade: 10
Optimality criterion defines the criterion to select between ______
1.   actions
2.  discounted rewards
3.  Q-values
4.  Policy values
QUESTION 40 [2270] RLI MDP Planning ?Multiple-Choice[1] ► Grade: 10
Q-Learning is an example of _____ learning algorithm.
1.   model-free
2.  model-based
3.  on-policy
4.  dynamic programming
Q-Learning is an example of model-free learning algorithm.
It does not assume that agent knows anything about the state-transition and reward models. However, the agent will discover what are the good and bad actions by trial and error.
The basic idea of Q-Learning is to approximate the state-action pairs Q-function from the samples of Q(s, a) that we observe during interaction with the environment. This approach is known as Temporal Differences (sometimes aka Time-
Difference) Learning.
The Bellman Optimality Equation is:
1.   q ∗ (s, a)
′ ′ ′
= ∑ p(s , r|s, a) [r + γmaxq ∗ (s , a )]
′
a
′
s ,r
2.  v π (s)
′ ′
= ∑ π(a|s) ∑ p(s , r|s, a) [r + γv π (s )]
a ′
s ,r
3.  v k+1 (s) = ∑ π(a|s) ∑ p(s , r|s, a) [r + γv k (s )]

′ ′
′
a s ,r
4.  neither of these
Reinforcement Learning tasks have no pre-generated training sets which they can learn from — they create their own experience and learn “on the fly”. To be able to do so, the Agent needs to try many different actions in
many different states in order to try and learn all available possibilities and find the path which will maximize its overall reward and use the information it learned to do so. The latter is known as ______, as the Agent uses its
knowledge to maximize the rewards it receives.
1.   Exploitation
2.  Policy execution
3.  Utilization
4.  Reward cropping
Reinforcement Learning tasks have no pre-generated training sets which they can learn from. They create their own experience and learn on the fly.
To be able to do so, the Agent needs to try many different actions in many different states in order to try and learn all available possibilities and find the path which will maximize its overall reward; this is known as Exploration, as the
Agent explores the Environment.
On the other hand, if all the Agent will do is explore, it will never maximize the overall reward. It must also use the information it learned to do so. This is known as Exploitation, as the Agent exploits its knowledge to maximize the rewards
it receives.
The trade-off between the two is one of the greatest challenges of Reinforcement Learning problems, as the two must be balanced in order to allow the Agent to both explore the environment enough, but also exploit what it learned and
repeat the most rewarding path it found.
If the reward is always +1 what is the sum of the discounted infinite return when γ < 1
k
Gt = ∑ γ R t+k+1
k=0
1.   G t =
1−γ
1
γ
2.  G t =
1−γ
3.  Infinity.
4.  G t = 1 ∗ γ k
Select the equation that correctly relates v∗to q∗. Assume π is the uniform random policy.
1.   v ∗ (s) = max a q ∗ (s, a)
2.  v ∗ (s) ′
= ∑ a,r,s ′ π(a|s)p(s , r|s, a)[r + q ∗ (s )]
′
3.  v ∗ (s) = ∑
a,r,s
′
′
π(a|s)p(s , r|s, a)[r + γq ∗ (s )]
′
4.  v ∗ (s) = ∑
a,r,s
′
′
π(a|s)p(s , r|s, a)q ∗ (s )
′
4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.741374
© 2024 - José Manuel Rey

Reinforcement Learning Exam

Uploaded by

Copyright:

Available Formats

You might also like

Reinforcement Learning Exam

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning Exam

Uploaded by

Copyright:

Available Formats

4/16/24, 2:52 PM EXAM 2024-04-16 14:52:46.

Answer Sheet Print Certificate

EXAM 2024-04-16 14:52:46.741374 - RLI_BCSAI_3_2024

3.   Give the agent -1 at each time step.

2.  2 ► 2 Sampled or Expected updates

QUESTION 8 [1370] RLI MDPs ?Multiple-Choice[1] ► Grade: 10

A(s,a) = E[ r(s,a) - r(s) ]

A(s,a) = Q(s,a) - V(s)

where Q(s,a) is the Q Value and V(s) is the Value function.

This is the backup diagram for ______

QUESTION 24 [805] RLI Machine Learning ?Multiple-Choice[1] ► Grade: 10

where the asterisk sign indicates optimal value

∇J (θ) ∝ ∑ μ(s) ∑ q π (s, a)∇π(a|s, θ)

QUESTION 33 [890] RLI Reinforcement Learning ?Multiple-Choice[1] ► Grade: 10

3.  v k+1 (s) = ∑ π(a|s) ∑ p(s , r|s, a) [r + γv k (s )]

You might also like