Assignment 2

1CM240
AI in Logistics - 2023/2024
Assignment 1 (50% of course grade)
Preliminary information
• This assignment should be performed in the groups with whom you registered in Canvas.
• Before each question, the total amount of points is indicated. In total, there are 100 points to
be earned in this assignment. Your final score is the number of points divided by 10.
• This is a challenge-based course, so the assignment balances between semi-open questions that
require you to proactively think and make well-motivated choices for formulating Markov decision
processes, and programming exercises.
• Mathematical notation is extremely important. Without proper notation, no points will be

awarded to each of the questions.
• Some questions require programming in Python. Please hand in the Python code. You need to
explain the code in enough detail either within the code via comments or by decomposing it as
if read through a Jupyter notebook.
• Please do not forget to fill in the mandatory peer-review forms. Without the peer-review forms,
no grade can be awarded to this course.
• All group members should show understanding of the answers they provided and should be able
to replicate them to a reasonable extent in a fully offline fashion. This means that suspicion
of excessive ChatGPT use will lead to additional questions and an in-person meeting in which
the group members are questioned about their approach. The teachers have the right to let a
group fail the assignment if it appears that the group members are not sufficiently aware of the
provided answers.
• We will give you feedback via the Speedgrader in Canvas. Please upload the pdf as a separate
file (i.e. do not include it in a zip). This file should include all peer reviews. We may subtract
points if these requirements are not met.
The assignment
1. [10 points] Picking is typically one of the most expensive operations in a warehouse. Therefore,
in many warehouses, the pickers are now supported by robots. In our setting, robots are following
the picker to the picking locations, the picker puts the parts on the robot, and the robot brings
the parts to the packing locations. The picking location is determined by an RL algorithm. The
algorithm determines the next picking location out of 5 potential picking orders. There are two
pickers in the warehouse. An example of the process can be seen here:
https://www.youtube.com/watch?v=JoVL0pbabqg.
Write down the MDP that underlies the RL algorithm. Make assumptions about the warehouse
layout. Also briefly discuss (at least one) good alternative state space and discuss the advantages
of using one or the other.
1
2. [15 points] You are the marketing manager of a huge online retail store dol.com. Since sales
dropped during the recession, you plan a sales campaign to target your customers. During the
campaign, every customer who visits the page gets a personalized voucher (assuming they are
logged in to their customer account, we ignore visitors who are not logged in). Due to privacy
reasons, all you know from your customers is their age (young or old) and salary (low or high)
class. You have three choices: not offer a voucher, 20 Euro off a sale above 100 Euro, or 5 Euro
on just any sale. You can observe whether the customer made a purchase or did not purchase
anything.
(a) Write down the MDP that describes the problem above.
(b) Use the provided code to learn a contextual three-armed bandit on whom to offer which ad-
vertisement. Show the learned actions dependent on the age and income class and visualize
learning over time.
(c) Compute the expected reward of the different actions for each context analytically and
discuss whether your bandit learned the optimal policy. Implement one other sampling
strategy and compare them in terms of regret.
3. [15 points] Hello Fish is providing meal boxes on demand. You can order meals of different
sizes which get delivered within an hour to your house. In an aim to become more sustainable,
they are interested in replacing their current one-time-use carton boxes in which they deliver
each meal with plastic boxes that can be reused multiple times. There is a known and given
customer base that orders frequently (but not always). In this exercise, we assume that we can
directly fulfill an order after it is placed.
However, the plastic boxes still need to be produced, and Hello Fresh currently has no clue about
how many boxes they need to produce and how to manage the return of these boxes. Therefore,
they asked us to provide a model of the operational processes of the return of boxes. In turn,
they want to use this model to evaluate how many boxes of what sort they need to order.
They asked us to take the following considerations into account.
• Assume that boxes are stored at a single warehouse. In the warehouse is a cleaning facility
through which the boxes must go after return and before their next use. In the model, you
might assume it takes a negligible amount of time to clean.
• We can assume each box has a lifetime distribution that depends on the number of usages.
It is said that the expected lifetime is 100 usages, irrespective of the box size. If a box is
broken, a cost is paid to repair it, which you can also assume happens instantaneously.
• Hello Fish warehouse operates 24/7, but boxes are only distributed to customers between
8 am and 10 pm.
• A customers’ meal demand is stochastic, but it conforms to be an integer expressing the
number of meals to order for that particular week. We can assume that the box size is
one-to-one related to the number of ordered meals.
• We can assume that the number of box sizes corresponds to the possible order sizes and is
given.
• If a customer orders, there is a probability that she is not at home. In that case, the current
boxes she possesses are not directly returned. If the customer is home, she will return all
the boxes she had previously delivered.
• The main decision Hello Fish wants you to focus on is determining at each customer order,
in real-time, what box to use. Some boxes might be too big, some might be too small, and
some boxes might be out of inventory.
• You might assume that you have a fixed customer base.
• The shipping cost per box relates to the box size. So, if we ship a box that is too large for
the ordered meals, we pay a higher cost than needed. In this model, you may assume that
you incur a penalty cost if you decide not to serve a customer.
2
• You are interested in minimizing the future discounted cost, for a given number of box sizes
and meals.
Model the decision on which box to assign to what customer as a discrete-time Markov decision
process. The model should be generic, meaning that it can comply with any number of boxes,
box sizes, customers, etc. Consult the slides of Module 1 - Markov decision processes for a full
description of what a Markov decision process model requires you to define. As a hint, pay
special emphasis on the information needed by each customer, how to track the number of times
a box is used, and how to model the return of boxes compared to the assignment of boxes to
customers. As a second hint, the provided MDP will probably be to large to solve to optimality.
4. [10 points - coding exercise] In the previous assignment, you created a MDP. In this question,
we want you to create a simulation that acts as the environment/system of the previously
modeled MDP. We want you to evaluate two policies in this system. The first policy assigns the
box with the closest fit to the order. The second policy uses a random feasible box (i.e. required
box size or larger). To get this to work, you at least need to:
• Create a procedure that can create a new demand. You can assume a demand distribution.
• A data structure that can store the information at each time epoch.
• A policy that makes a decision.
In your report, provide the relevant parts of the code and comment on them.
5. [20 points]. Hello Fish also considers buying the boxes externally, according to a single sourcing
system. They decide to buy 1 type of box. Hello Fish can order 25,000 boxes at a cost of 1
per box, or it can order 10,000 units at a cost of 1.20 per unit. The capacity of its inventory is
50,000. If the producer orders the product, it must wait two weeks before it arrives. The holding
costs per box per week equal 1, and the lost sales costs equal 20. The demand distribution per
week equals:
Demand 10,000 15,000

Probability 0.35 0.65
The sequence of events is as follows: At the start of a period (a week), a new order can be
placed. Afterward, demand for the week is observed and served. Then, the order ordered two
weeks ago arrives, meaning this order was placed at the start of the previous week.
Thus, the state at the start of the period system is tuple St = (It , Ot ), describing the current
inventory level It , the order made at the beginning of the previous period Ot .
(a) Define stochastic sequential decision problem as a Markov decision process using the concept
of pre- and post-decision states.
The producer has a simple policy. It orders 15,000 units as soon as its inventory position (It +Ot )
equals 0.
(b) The cost of the optimal policy can be obtained via policy iteration. First, write out two
iterations of the policy evaluation algorithm of the proposed policy above. Ensure the logic
is clear. You are allowed to use Python or Excel (or any other means) to speed up the
somewhat tedious calculations, but we require you to write out in a mathematically proper
way the two policy evaluation iterations.
(c) Assume that the policy evaluation algorithm converged after the two iterations. Proof that
this policy is not optimal by applying the policy improvement algorithm.
(d) The above two steps can be seen as some form of policy iteration. Explain what the
procedure would have been if value iteration was chosen as the method to solve this problem.
Write down what equations and calculations should have been made. Again, do this in a
proper mathematical way, but this does not require solving the equations.
3
6. [12 + 18 points - coding exercise] You are the logistic manager of Blopper, a whatever-you-
need-it-is-probably-there retail store in the Netherlands. As such you are responsible for the
transport of goods to the stores. You completely control the trucks that deliver from the depot
to the stores. That means you decide when a truck goes to a store and the items and quantities
on the truck. You have a smart employee, Mrs. Slim, who has already created a division of the
complex network you usually have to handle and optimize. One division of the divided network
contains three stores, two in the south of Eindhoven and one close to the university in the center.
The demand (in containers) is distributed according to a Gamma distribution with
Index Location Street Demand

Mean µi Variance σi2
1 South Alsterweg 2.9 1.3
2 South Bosranklaan 4.5 2.1
3 Center Stationsplein 3.0 2.3
Demand is rounded up to full pallets. You need to plan the replenishment of the stores. When
you replenish stores you have to pay a fixed cost and a route-dependent transportation cost.
The sum of both costs are given in this table (in € per replenishment):
Route (1) (2) (3) (1,2) (1,3) (2,3) (1,2,3)

Costs 40 40 55 60 75 75 500
If you fail to supply the right quantity to a store and the store stocks out, you have to pay a
backorder cost of 18€ per container. For every unit left at a store after the day’s sales have
ended, you need to pay a holding cost of 1, 20€ per day and container. You assume that demand
happens over the day and the replenishment happens at the start of the day. Demand is realized
in units of containers. You can assume demand is at most ten and that at most ten items can
be in stock in each store. Upon replenishment, items are immediately put on shelves and are
ready for sale.
(a) Write down this stochastic sequential decision problem as a Markov decision process.
(b) Use the value iteration algorithm to (try to) solve this Markov decision process without
using pre and post decision states. It might take a lot of time before the algorithm is
converged. To improve the speed, you can use the concept of pre and post-decision states.
Implement both ways and compare the convergence of the algorithm.
Due to a bug in the ordering system, it might happen with a known probability that only a
certain percentage of the ordered demand arrives at a store give probability and percentage?.
Also, the maximum inventory of the store is increased to 100, and the demand mean and variance
are as in the table below. Negative demands are assumed to be 0.
Index Location Street Demand

Mean µi Variance σi2
1 South Alsterweg 29 6
2 South Bosranklaan 45 18
3 Center Stationsplein 30 20
(c) Train tabular Q-Learning and SARSA on this problem.

(d) Analyze the outcome of your algorithms. Give a brief intuition on the decisions proposed
by the algorithm (or lay out the differences with your intuition).

Assignment 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment 2

Uploaded by

Copyright:

Available Formats

1CM240

• Mathematical notation is extremely important. Without proper notation, no points will be

Demand 10,000 15,000

Index Location Street Demand

Route (1) (2) (3) (1,2) (1,3) (2,3) (1,2,3)

Index Location Street Demand

(c) Train tabular Q-Learning and SARSA on this problem.

You might also like