Professional Documents
Culture Documents
04.02 OBDP2021 Romanelli
04.02 OBDP2021 Romanelli
04.02 OBDP2021 Romanelli
• the limited and variable downlink bandwidth. where G represents the return and γ is a parameter be-
tween [0,1] called discount factor. The discount factor
determines the present value of future rewards: as it ap-
This environment can be partially solved using a rule- proaches 1, the return objective takes future rewards into
based system, however his randomness and the variabil- account more strongly.
ity of conditions not directly connected to the mission
(e.g.: clouds on the ground station) can greatly increase As a practical example, an RL-based space lander needs
the number of states to envision and to solve. to learn an optimal policy for controlling the engines and
attitude system to avoid crashes on the ground or hard
In this context a learning system can understand his envi- land. The agent is punished if the action taken leads to
ronment acting accordingly without an explicit definition destruction or is rewarded in a successful landing.
of the states to foresee.
This expectancy is also known, in the simplified frame- The drawback of this class of methods are that: generic
work of Artificial Intelligence, as a reward function r. search algorithms are not guaranteed to find an optimal
solution and they are not guaranteed to run fast (in poly- We can think of the problem as an episodic task by mak-
nomial time). ing some assumptions. First, there is a collection phase
in which the satellite collects all the data during its trajec-
We treated our problem as one of these combinatorial tory before reaching GS visibility. When a GS is reached,
problems, the Knapsack Problem (KP). KP is a combi- the agent starts performing actions, selecting a particular
natorial optimization problem that aims to maximize the set of packets scheduled for the downlink to optimize the
value of the items contained in a knapsack subject to a throughput and downlink efficiency. If the agent can no
weight constraint. There are a few versions of the prob- longer select any packet due to the downlink capacity or
lem in the literature. We considered two of them as they no more are available, the episode ends, and the loop re-
are closely related to our scheduling problem. peats.
Figure 5. Average of results after 50 runs with 100 Figure 6. PPO-Baseline comparison, online case
episode each in the offline case. The upper bound rep-
resents the optimal solution found by the dynamic pro-
gramming approach. ning the DQN agent and the baseline after ten episodes.
We can notice how the PPO agent beats the baseline al-
gorithm in each episode. Again, we averaged the results
used an optimal solution based on dynamic program- over 50 runs with 100 episodes each to have a more ro-
ming to have a measure of the maximum value of pri- bust benchmark (Figure 6).
ority that can be obtained within a specific sequence of
packets. The comparison was made by comparing the
different agents/algorithms on the same data buffer in
each episode. Figure 4 shows the results after running the 4.3. Deployment on embedded devices
DQN agent and the baseline after ten episodes. We can
see that both exploit almost all of the downlink capacity,
but in terms of total downloaded priority, the DQN agent The deployment process requires the adaptation of the
performs quite better. original agent networks for the target embedded device.
This is commonly accomplished by using Neural Net-
Figure 5 instead shows the average results over 50 runs work optimization framework such as Tensorflow Lite or
with 100 episodes each. We can notice how the DQN Intel Openvino.
agent approaches the optimal value found by the dynamic
programming solution, bearing in mind that its recur- The commonly used pipeline usually involve many steps.
sive structure took a lot of computational time in some Tensorflow APIs are generating, during the training pro-
episodes to find a solution. cess, a model (in the form of a single or multiple files)
composed by an architecture specification file and an as-
In the online version of the problem, we didn’t have a sociated weights file. This model file is than optimized
greedy solution to compare with. So we used one of the using a TFLite converter which reduces the model mem-
most simply scheduling algorithms, a First Come First ory footprint and inference time (by performing network
Service approach. Figure 6 shows the results after run- quantization [5] and connection pruning).
REFERENCES