Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

Energy

A Comparative Study of 13 Deep Reinforcement Learning Based Energy Management


Methods for a Hybrid Electric Vehicle
--Manuscript Draft--

Manuscript Number: EGY-D-22-07001R1

Article Type: Full length article

Keywords: hybrid electric vehicle; energy management strategy; deep reinforcement learning

Abstract: Energy management strategy (EMS) has a huge impact on the energy efficiency of
hybrid electric vehicles (HEVs). Recently, fast-growing number of studies have applied
different deep reinforcement learning (DRL) based EMS for HEVs. However, a unified
performance review benchmark is lacking for most popular DRL algorithms. In this
study, 13 popular DRL algorithms are applied as HEV EMSs. The reward performance,
computation cost, and learning convergence of different DRL algorithms are discussed.
In addition, HEV environments are modified to fit both discrete and continuous action
spaces. The results show that the stability of agent during the learning process of
continuous action space is more stable than discrete action space. In the continuous
action space, SAC has the highest reward, and PPO has the lowest time cost. In
discrete action space, DQN has the lowest time cost, and FQF has the highest reward.
The comparison among SAC, FQF, rule-based, and equivalent consumption
minimization strategies (ECMS) shows that DRL EMSs run the engine more efficiently,
thus saving fuel consumption. The fuel consumption of FQF is 10.26% and 5.34% less
than Rule-based and ECMS, respectively. The contribution of this paper will speed up
the application of DRL algorithms in the HEV EMS application.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Point-to-Point Response to Reviewers

Response to the editor and reviewers


Henrik Lund
Editor-in-Chief, Energy

Title: A Comparative Study of 13 Deep Reinforcement Learning Based Energy Management


Methods for a Hybrid Electric Vehicle

Ms. Ref. No.: EGY-D-22-07001

Dear Prof. Lund:

The above referenced manuscript has been revised as per reviewer suggestions and is being
resubmitted. We greatly appreciate the constructive and insightful comments from the reviewers.
Having read carefully the reviews from our original manuscript, we feel that we can satisfactorily
address their comments and questions. Our responses to each of their remarks are included below.

Changes in the revised manuscript are highlighted in red.

We organized our responses in a “threaded” format, where reviewer and associate editor comments
are included as normal typeface and our responses are denoted by italic text. We hope you find the
manuscript much improved in its clarity, content and scope.

Thank you and the reviewers for your efforts in the constructive review of this paper.

Sincerely,

Bin Xu (binxu@ou.edu) – corresponding author


Reviewer #1
The curves in the figure 6&7 can be distinguished by lines. Check the table 3, please.

Response/ Action: Thanks reviewer for this comment. Add grids to figure 6&7 to make lines
distinguished. The mistake in table 3 have corrected in the revised manuscript (remove
unnecessary lines).
The modified figure 6&7 and table 3 are shown in below:

Fig. 7 Comparison of DRL algorithms in continuous action space.


Fig. 8 Comparison of DQN in different discrete action spaces
Table 3 Parameters of DQN algorithm in different discrete action spaces.
ACTION COMPU CONVE CONVE CONVE UDDS UDDS WLTP WLTP HWFET HWFET
SPACE TATION RGENC RGENC RGENC TEST TEST TEST TEST TEST TEST
TIME (S) E E E TIME REWAR MPGE REWAR MPGE REWAR MPGE
REWAR EPISOD (S) D D D
D E
50 1045 -1408 6 30 -599.2 42.09 -1404.9 34.58 -933.8 37.04
200 1041 -1412 28 146 -600.0 42.03 -1403.9 34.63 -932.9 37.09
2000 1067 -1402 31 155 -572.5 44.14 -1389.3 35.01 -932.5. 37.11
5000 1074 -1415 77 528 -546.9 46.24 -1390.6 35.01 -932.6 37.10

vehicle speed unit need to be revised in fig 9(a).

Response/ Action: Thanks reviewer for this comment. The vehicle speed unit have replaced with
the correct one in the revised manuscript (mps to mph).
The modified figure is showed below:
Fig. 10 Comparison of SAC, FQF, rule-based, and ECMS EMSs in UDDS Test
A combined driving cycle. should be used, for example, NEDC,WLTP ...
Response/ Action: Thanks reviewer for this comment. Besides the UDDS cycle used in the original
work, WLTP and HWFET cycle are included in the revised manuscript. WLTP contains the most
comprehensive situations, so we utilize WTLP driving cycle to training the agents. In the testing
process, we utilize both three type driving cycles (UDDS, WLTP, HWFET) to do the testing.
The training and testing results are as follows:
Table 2 Parameters of DRL algorithm in continue action space.
AGENT COMPU CONVE CONVE CONVE UDDS UDDS WLTP WLTP HWFET HWFET
TATION RGENC RGENC RGENC TEST TEST TEST TEST TEST TEST
TIME (S) E E E TIME REWAR MPGE REWAR MPGE REWAR MPGE
REWAR EPISOD (S) D D D
D E
PPO 866 -1412.2 11 39 -774.1 32.57 -1573.3 30.93 -932.2 37.11

TRPO 1406 -1406.1 21 144 -782.2 32.20 -1662.0 29.23 -933.6 37.01

TD3 1321 -1406.4 31 186 -602.7 41.84 -1406.8 34.56 -931.2. 37.16

SAC 2042 -1405.2 74 797 -601.5 41.94 -1401.2 34.68 -932.6 37.10

DDPG 1201 -1405.7 75 446 -605.8 41.62 -1436.8 33.81 -934.2 37.03

Table 3 Parameters of DQN algorithm in different discrete action spaces.


ACTI COMPUTAT CONVERGE CONVERGE CONVERGE UDDS UDD WLTP WLT HWFE HWF
ON ION TIME NCE NCE NCE TIME TEST S TEST P T TEST ET
SPAC (S) REWARD EPISODE (S) REWA TES REWA TES REWA TEST
E RD T RD T RD MPG
MPG MPG E
E E
50 1045 -1408 6 30 -599.2 42.09 -1404.9 34.58 -933.8 37.04

200 1041 -1412 28 146 -600.0 42.03 -1403.9 34.63 -932.9 37.09

2000 1067 -1402 31 155 -572.5 44.14 -1389.3 35.01 -932.5. 37.11

5000 1074 -1415 77 528 -546.9 46.24 -1390.6 35.01 -932.6 37.10

Table 4 Parameters of DRL algorithm in discrete action space.


AGENT COMPUTAT CONVERGE CONVERGE CONVERGE UDDS UDD WLTP WL HWFE HWF
ION TIME NCE NCE NCE TIME TEST S TEST TP T TEST ET
(S) REWARD EPISODE (S) REWA TES REWA TES REWA TEST
RD T RD T RD MPG
MPG MPG E
E E
IQN 1167 -1409 35 216 -586.8 43.00 -1402.5 34.65 -931.2 37.16

SAC 1414 -1413 48 317 -574.7 43.94 -1386.7 35.06 -932.9 37.08

DQN 1027 -1423 31 179 -572.7 44.14 -1389.3 35.01 -932.6. 37.10

D2QN 1063 -1407 35 185 -575.0 43.92 -1396.3 34.82 -931.6 37.13

D3QN 1542 -1406 46 302 -602.8 41.84 -1406.1 34.56 -930.6 37.18

FQF 1813 -1404 56 523 -572.6 44.16 -1371.6 35.49 -934.4 37.03

C51 1241 -1408 63 399 -578.9 43.61 -1378.7 35.27 -933.7 37.05

QR- 2113 -1406 32 328 -583.9 43.27 -1372.2 35.45 -933.9 37.05

DQN

RAINB 2378 -1405 60 549 -602.1 41.90 -1398.8 34.77 -932.4 37.11

OW

the authors adopt the 2016 Toyota Highlander Hybrid Vehicle, the struture of the vehicle
should be given

Response/ Action: Thanks reviewer for this comment. The vehicle propulsion system structure is
added in the first paragraph of section 2 in the revised manuscript as follows:
The vehicle propulsion system architecture is given in Fig. 1. As shown in the figure, both the
engine and electric motor (EM) supply power to the front wheel.
Fig. 1. Vehicle propulsion system architecture.

The RPM of the engine should be given in the simulation results

Response: Thanks reviewer for this comment. It is correct that most papers in the EMS field
provide engine RPM in results analysis. However, in our case it is slightly different because of the
FASTsim model we utilized. Due to the large number of simulation required by the reinforcement
learning algorithms, computationally efficient model has to be utilized in this paper. In order to
speed up the simulation, the FASTsim model only considers the power values of engine and the
details of speed/torque are not considered. Therefore, RPM information is not available in our
paper.
Action: We added following description in the first paragraph of the section 2.3 to clarify this
point in the revised manuscript as follows:
Efficiency is usually modeled as a function of speed and torque. However, it is modeled as a
function of power to reduce the computation cost of the model as described by FASTsim
development team [27], which means the speed and torque details are not provided by the model
and only power values are provided for the ICE and EM.

Reviewer #2:
Not very clearly. I think a big limitation is the fact that the algorithm is trained with fixed
driving cycles.

Response/ Action: Thanks reviewer for this comment. In order to address this limitation, we use
three different driving cycles (WLTP, UDDS, HWFET). We use the WLTP driving cycle to train
the agent based on the comprehensive information it provided (different speeds (low, medium, high
and extra high) and a variety of driving conditions (different driving phases, stops, acceleration
and braking)). The agent trained in WLTP environment will provide flexibility in dealing familiar
scenarios. To validate the performance, the test process is set on the UDDS urban driving cycle
and HWFET highway driving cycle. The results shows that agent trained by WLTP can handle the
UDDS and HWFET driving cycle with a good performance in the revised manuscript.
The test results and description are as follows:
“The comparison of the result shows that PPO, TRPO and TD3 have higher learning speeds at
the beginning of the training. The reward of SAC changes slowly compared with the rest of the
algorithms. As we can see from Table 2, SAC has the highest convergence reward. However, the
convergence speed of SAC is much slower than other DRL algorithms. PPO has the fastest
convergence speed compared to other algorithms. PPO requires 72.9%, 79.0%, 95.1% and 91.3%
less time consumption than TRPO, TD3, SAC and DDPG, respectively, to reach convergence.
SAC’s convergence reward is 0.50%, 0.06%, 0.09%, 0.04% higher than PPO, TRPO, TD3, DDPG
respectively. We also can notice that TD3’s learning curves are more stable than SAC, DDPG,
PPO and TRPO after achieving convergence according to its flatness. For training the same total
episode, PPO saves 38.4%, 34.4%, 57.6% and 27.9% time than TRPO, TD3, SAC and DDPG,
respectively. Also, from the table record, SAC has the highest test reward and MPGE value in
WLTP and UDDS. SAC’s test reward is 10.9%, 15.7%, 0.40%, 2.48% higher than PPO, TRPO,
TD3, DDPG respectively.”
Table 2 Parameters of DRL algorithm in continue action space.
Agent COMPU CONVE CONVE CONVE UDDS UDDS WLTP WLTP HWFET HWFET
TATION RGENC RGENC RGENC TEST TEST TEST TEST TEST TEST
TIME (S) E E E TIME REWAR MPGE REWAR MPGE REWAR MPGE
REWAR EPISOD (S) D D D
D E
PPO 866 -1412.2 11 39 -774.1 32.57 -1573.3 30.93 -932.2 37.11

TRPO 1406 -1406.1 21 144 -782.2 32.20 -1662.0 29.23 -933.6 37.01

TD3 1321 -1406.4 31 186 -602.7 41.84 -1406.8 34.56 -931.2. 37.16

SAC 2042 -1405.2 74 797 -601.5 41.94 -1401.2 34.68 -932.6 37.10

DDPG 1201 -1405.7 75 446 -605.8 41.62 -1436.8 33.81 -934.2 37.03

“As shown in Table 3, the computation time does not have too much difference among action
spaces 50, 200, 2000 and 5000. However, the time cost to achieve convergence in action spaces
50 is much less than in action spaces 200, 2000 and 5000. The final test reward of DQN in action
space 2000 is the highest, which is 1.11%, 1.04%, and 0.09% higher than action spaces 50, 200,
and 5000, respectively in WLTP. DQN applied in action space 2000 has the best fuel economy,
which is 25.01 MPGE. In the following analysis, 2000 is selected as the action space discretization.”
Table 3 Parameters of DQN algorithm in different discrete action spaces.
ACTION COMPU CONVE CONVE CONVE UDDS UDDS WLTP WLTP HWFET HWFET
SPACE TATION RGENC RGENC RGENC TEST TEST TEST TEST TEST TEST
TIME (S) E E E TIME REWAR MPGE REWAR MPGE REWAR MPGE
REWAR EPISOD (S) D D D
D E
50 1045 -1408 6 30 -599.2 42.09 -1404.9 34.58 -933.8 37.04
200 1041 -1412 28 146 -600.0 42.03 -1403.9 34.63 -932.9 37.09
2000 1067 -1402 31 155 -572.5 44.14 -1389.3 35.01 -932.5. 37.11
5000 1074 -1415 77 528 -546.9 46.24 -1390.6 35.01 -932.6 37.10

“DQN saves 3.2%, 40.7%, 43.5%, 17.1%, 55.1%, 45.4%, 65.8%, 67.4% computation time than
D2QN, D3QN, SAC, IQN, C51, QR-DQN, FQF, and Rainbow, respectively, to reach convergence.
FQF has the best convergence reward, which is 0.14%, 0.28%, 0.07% higher than QR-DQN, C51
and Rainbow, 1.34%, 0.21%, 0.14% higher than DQN, D2QN and D3QN respectively, 0.35%,
0.64% higher than IQN and SAC. To train over the same total episode, DQN saves 12.0%, 27.4%,
3.4%, 33.4%, 43.4%, 17.2%, 51.4%, 56.8% time than IQN, SAC, D2QN, D3QN, FQF, C51, QR-
DQN, Rainbow respectively. And FQF has the best MPGE performance (35.49) compared to other
algorithms in WLTP”
Table 4 Parameters of DRL algorithm in discrete action space.
Agent Computation Convergence Convergence Convergence UDDS UDDS WLTP WLTP HWFET HWFET
Time (s) Reward Episode Time (s) Test Test Test Test Test Test
Reward MPGE Reward MPGE Reward MPGE
IQN 1167 -1409 35 216 -586.8 43.00 -1402.5 34.65 -931.2 37.16

SAC 1414 -1413 48 317 -574.7 43.94 -1386.7 35.06 -932.9 37.08

DQN 1027 -1423 31 179 -572.7 44.14 -1389.3 35.01 -932.6. 37.10

D2QN 1063 -1407 35 185 -575.0 43.92 -1396.3 34.82 -931.6 37.13
D3QN 1542 -1406 46 302 -602.8 41.84 -1406.1 34.56 -930.6 37.18

FQF 1813 -1404 56 523 -572.6 44.16 -1371.6 35.49 -934.4 37.03

C51 1241 -1408 63 399 -578.9 43.61 -1378.7 35.27 -933.7 37.05

QR-DQN 2113 -1406 32 328 -583.9 43.27 -1372.2 35.45 -933.9 37.05

RAINBOW 2378 -1405 60 549 -602.1 41.90 -1398.8 34.77 -932.4 37.11

Yes, I recommend changing section 2 from "modeling" to "environment" where


"powertrain model", "driving cycles", "state, action and reward" are clearly defined.
Response/ Action: Thanks reviewer for this comment. The name of section 2 is replaced with
environment modeling in the revised manuscript.
P2L8, " But the fuel-saving performance of this stepwise optimization shows a limitation
in real-time." This is not very clear. Consider "The one step optimization results in an
inferior fuel-saving performance.".
Response/ Action: Thanks reviewer for this comment. The sentence has been replaced with “The
one step optimization results in an inferior fuel-saving performance.” in the revised manuscript.
"The greedy algorithm is one of methods to balance the exploitation and exploration.",
this is very confusing. Greedy only exploits.
Response/ Action: Thanks reviewer for this comment. The “greedy algorithm” is replaced with
“𝜀-greedy algorithm” for better expression in the revised manuscript. The 𝜀-greedy is a method
which balance the exploration and exploitation by choosing exploration and exploitation
randomly. The 𝜀 refers to probability of choosing to explore, exploits. Most time the chance of
exploring is small and chooses the greedy action to get the most reward by exploiting.
The revised one is as follows:
“The ε-greedy algorithm is one of methods to balance the exploitation and exploration. It chooses
exploration and exploitation randomly with different probability ε. In most time, the chance of
exploration is small and it chooses the greedy action to get the highest reward by exploiting.”
"And it was showed that temporal difference learning had a relatively better performance
in convergence part and can be applied in non-Markovian environment." Language can
be improved. In addition, does the cited work use POMDP assumption? Otherwise, I don't
think RL method can get away with MDP assumption.
Response/ Action: Thanks reviewer for this comment. To eliminate the confusion, the sentence has
been modified to “The results comparison showed that temporal difference learning algorithm had
a good fuel saving performance in both real-world and testing cycles. The research also showed
that this power management policy does not need complete information of driving cycle in prior.”.
The ‘non-Markovian environment’ is not a clear expression’. The cited work use MDP assumption.
I recommend adding "Zhu, Zhaoxuan, Yuxing Liu, and Marcello Canova. "Energy
management of hybrid electric vehicles via deep Q-networks." 2020 American Control
Conference (ACC). IEEE, 2020." to the literature review in the first paragraph in P3, where
it has a comparison among ECMS, DP and DQN on the fuel-saving performance.
Response/ Action: Thanks reviewer for this comment. The relative paper and related description
were added in first paragraph in P3 in the revised manuscript. (In [21], a model-free deep
reinforcement learning method (DQN) was obtained for a mild-HEV in ECMS with a traffic model
built in SUMO. The results were compared with solutions from DP and A-ECMS. It showed that
deep reinforcement learning provides a more general framework when extra information and
situations were considered.)
"Unlike conventional vehicle models built in Matlab/Simulink, an OPEN-AI Gym like HEV
propulsion model is built in Python." I don't think this is a academic contribution. In "Zhu,
Zhaoxuan, et al. "A deep reinforcement learning framework for eco-driving in connected
and automated hybrid electric vehicles." arXiv preprint arXiv:2101.05372 (2021).", the
environment is developed in Python and connected to traffic simulator SUMO for more
realistic scenario". I do very much acknowledge the contribution that you can open source
the environment for better comparison.
Response/ Action: Thanks the reviewer for this valuable comment and the acknowledgement of our
open source decision. Therefore, we integrate this OPEN-AI Gym like environment in the open
source contribution and reduce the contribution from four to three as follows:

 Unlike conventional vehicle models built in Matlab/Simulink, an OPEN-AI Gym like HEV
propulsion model is built in Python. Following the format of OPEN-AI Gym, this vehicle
environment can be directly connected to all popular DRL frameworks. The entire code sets
utilized in this benchmark generation are made available on GitHub
(https://github.com/LittleWebCat/DRL-Base-EMS) so that this benchmark can be utilized in
newly developed DRL algorithm evaluation in HEV EMS field.
 13 popular DRL algorithms are introduced with architecture diagram and both discrete and
continuous action space algorithms are considered.
 Key measures, including cumulative reward, convergence reward, convergence fuel economy,
convergence episode, and training time, are compared.

"The entire code sets utilized in this benchmark generation are made available on GitHub
so that this benchmark can be utilized in newly developed DRL algorithm evaluation in
HEV EMS field. " Please include the link here too to make it more obvious, I do consider
this as a major contribution.
Response/ Action: Thanks reviewer for this comment. The link has been included in "The entire
code sets utilized in this benchmark generation are made available on GitHub so that this
benchmark can be utilized in newly developed DRL algorithm evaluation in HEV EMS field.
(https://github.com/LittleWebCat/DRL-Base-EMS) " in the revised manuscript.
"Learning is conducted on a UDDS driving cycle." This is not ideal. The advantage of the
RL algorithm over the classic ECMS is its ability to generalize without engineering tuning.
The environment and the benchmark framework is should include more scenarios for
proper training and validation separation.
Response/ Action: Thanks reviewer for this comment. The Learning process is modified to
conducted on WLTP driving cycle. As the description of WLTP, the WLTP is divided into four
parts which contains different speeds (low, medium, high and extra high). And each part contains
a variety of driving conditions (different driving phases, stops, acceleration and braking). In that
case, WLTP driving cycle contains most comprehensive situations compare to UDDS and HWFET.
To train a flexible agent, a complex and varied training environment is necessary in the training
process. In our research, using WLTP as the training environment is reasonable. To test the
adaptability of the agent trained by WLTP in different scenarios, the tests are applied on UDDS
and HWFET driving cycle which contain urban driving conditions and highway conditions.
How does the agent handle scenarios such as "battery SoC cannot go below certain level"?
It is not reflected in rewards in Equation (15)?
Response/ Action: Thanks reviewer for pointing out this key issue. We solve this SOC depletion
issue by adding constraint in the environment model rather than considering SoC in reward. In
Eq. (11), we reduce the battery power output when SOC level is below reference and only allow
battery power auxiliary load:
𝑃𝑑𝑚𝑑 ∗ 𝜂𝑡𝑟𝑎𝑛𝑠 , 𝑃𝑑𝑚𝑑 < 0
𝑃𝐸𝑀,𝑑𝑚𝑑 𝑃
= { 𝐸𝑀,𝑚𝑎𝑥 (2𝑎 − 1), 𝑃𝑑𝑚𝑑 > 0, 𝑆𝑂𝐶 > 𝑆𝑂𝐶𝑟𝑒𝑓 (11)
𝑃𝑎𝑢𝑥 , 𝑃𝑑𝑚𝑑 > 0, 𝑆𝑂𝐶 ≤ 𝑆𝑂𝐶𝑟𝑒𝑓

Fig 6 missing SAC? Actor-critic can all handle continuous space.


Response/ Action: Thanks reviewer for this comment. SAC is added in the Fig 6 and related
analysis is explained in the revised manuscript.
The modified figure and explanation are shown as follows:
“As we can see from Table 2, SAC has the highest convergence reward. However, the convergence
speed of SAC is much slower than other DRL algorithms. PPO has the fastest convergence speed
compared to other algorithms. PPO requires 72.9%, 79.0%, 95.1% and 91.3% less time
consumption than TRPO, TD3, SAC and DDPG, respectively, to reach convergence. SAC’s
convergence reward is 0.50%, 0.06%, 0.09%, 0.04% higher than PPO, TRPO, TD3, DDPG
respectively. We also can notice that TD3’s learning curves are more stable than SAC, DDPG,
PPO and TRPO after achieving convergence according to its flatness. For training the same total
episode, PPO saves 38.4%, 34.4%, 57.6% and 27.9% time than TRPO, TD3, SAC and DDPG,
respectively. Also, from the table record, SAC has the highest test reward and MPGE value in
WLTP and UDDS. SAC’s test reward is 10.9%, 15.7%, 0.40%, 2.48% higher than PPO, TRPO,
TD3, DDPG respectively.”
Fig. 7 Comparison of DRL algorithms in continuous action space.
How is training and testing sets separated?
Response/ Action: Thanks reviewer for this comment. In three different driving cycle types, UDDS
contains 1369 steps, WLTP contains 1800 steps, HWFET contains 765 steps. In this case, the
WLTP driving cycle contains the most comprehensive situations. The WLTP driving cycle contains
more situation and more complex than UDDS and HWFET. In reinforcement learning, the agent
trained in complex environment will have the ability to handle more situations than trained in less
complex environment. So, in our research, the agent is trained in WLTP driving cycle and save
the best performance model. And in order to test the agent performance in different driving cycle
conditions, the test is given in WLTP, UDDS and HWFET driving cycle using best performance
model of WLWP. The results show that agent trained in WLTP can handle the UDDS and HWFET
driving cycle with a good performance.
In general, please highlight the best algorithms in bold for each column in tables.
Response/ Action: Thanks reviewer for this comment. The best algorithm’ results are highlighted
in bold in tables in the revised manuscript.
The modified table is shown below:
Table 2 Parameters of DRL algorithm in continue action space
Agent COMPU CONVE CONVE CONVE UDDS UDDS WLTP WLTP HWFET HWFET
TATION RGENC RGENC RGENC TEST TEST TEST TEST TEST TEST
TIME (S) E E E TIME REWAR MPGE REWAR MPGE REWAR MPGE
REWAR EPISOD (S) D D D
D E
PPO 866 -1412.2 11 39 -774.1 32.57 -1573.3 30.93 -932.2 37.11

TRPO 1406 -1406.1 21 144 -782.2 32.20 -1662.0 29.23 -933.6 37.01

TD3 1321 -1406.4 31 186 -602.7 41.84 -1406.8 34.56 -931.2. 37.16

SAC 2042 -1405.2 74 797 -601.5 41.94 -1401.2 34.68 -932.6 37.10

DDPG 1201 -1405.7 75 446 -605.8 41.62 -1436.8 33.81 -934.2 37.03

.
Table 3 Parameters of DQN algorithm in different discrete action spaces.
ACTION COMPU CONVE CONVE CONVE UDDS UDDS WLTP WLTP HWFET HWFET
SPACE TATION RGENC RGENC RGENC TEST TEST TEST TEST TEST TEST
TIME (S) E E E TIME REWAR MPGE REWAR MPGE REWAR MPGE
REWAR EPISOD (S) D D D
D E
50 1045 -1408 6 30 -599.2 42.09 -1404.9 34.58 -933.8 37.04
200 1041 -1412 28 146 -600.0 42.03 -1403.9 34.63 -932.9 37.09
2000 1067 -1402 31 155 -572.5 44.14 -1389.3 35.01 -932.5. 37.11
5000 1074 -1415 77 528 -546.9 46.24 -1390.6 35.01 -932.6 37.10

Table 4 Parameters of DRL algorithm in discrete action space.


Agent Computation Convergence Convergence Convergence UDDS UDDS WLTP WLTP HWFET HWFET
Time (s) Reward Episode Time (s) Test Test Test Test Test Test
Reward MPGE Reward MPGE Reward MPGE
IQN 1167 -1409 35 216 -586.8 43.00 -1402.5 34.65 -931.2 37.16

SAC 1414 -1413 48 317 -574.7 43.94 -1386.7 35.06 -932.9 37.08

DQN 1027 -1423 31 179 -572.7 44.14 -1389.3 35.01 -932.6. 37.10

D2QN 1063 -1407 35 185 -575.0 43.92 -1396.3 34.82 -931.6 37.13

D3QN 1542 -1406 46 302 -602.8 41.84 -1406.1 34.56 -930.6 37.18

FQF 1813 -1404 56 523 -572.6 44.16 -1371.6 35.49 -934.4 37.03

C51 1241 -1408 63 399 -578.9 43.61 -1378.7 35.27 -933.7 37.05

QR-DQN 2113 -1406 32 328 -583.9 43.27 -1372.2 35.45 -933.9 37.05

RAINBOW 2378 -1405 60 549 -602.1 41.90 -1398.8 34.77 -932.4 37.11

Is there a terminal SoC constraint to the problem? It is critical.


Response: We thank the reviewer for notice this issue. We do agree that most literature in this area
applied this constraint. However, we didn’t constraint the terminal SoC to be the same as the
initial SoC as we think it is not practical in the actual varying driving conditions after vehicles are
sold to the customers. As the actual driving distance change, the final/ terminal SoC won’t be the
same for most real-world driving. Therefore, we utilized equivalent MPG (MPGE) to evaluate the
fuel economy performance using the 33.7 kWh per gallon of gas provided by EPA, which is also
widely used in EPA for new car labeling.
Action: In the first paragraph of section 4.1, following sentence is added:
The final test reward and final test miles per gallon equivalent (MPGE) is the test result using the
best performance agent saved during the learning process. MPGE considers the difference of
initial and final SOC of the battery by converting the SOC difference to gasoline using 33.7 kWh
per gallon ratio provided by EPA [49].
Reviewer #3:
It is not clear what's the gap.

Response/ Action: We thank the reviewer for this comment. The gap is summarized at the last third
paragraph of the introduction section as follows:
Based on the literature review, a research gap is found, which is a unified benchmark of different
DRL based EMSs for electrified powertrain is lacking. In most existing studies, it only contained
up to 2-3 DRL algorithms and showed the final reward without computation cost information. As
different DRL algorithms are studied in different EMS literature, it is crucial to have a unified
benchmark based on various popular DRL algorithms to make a fair comparison of their
performance.

Please re-summarize the main contributions of this paper. I don't think using python to
recreate a model built in Matlab/Simulink can be a major contribution.
Response/ Action: Thanks reviewer for pointing out this issue. We revised the contribution at the
last second paragraph of the introduction section as follows:

 Unlike conventional vehicle models built in Matlab/Simulink, an OPEN-AI Gym like HEV
propulsion model is built in Python. Following the format of OPEN-AI Gym, this vehicle
environment can be directly connected to all popular DRL frameworks. The entire code sets
utilized in this benchmark generation are made available on GitHub
(https://github.com/LittleWebCat/DRL-Base-EMS) so that this benchmark can be utilized in
newly developed DRL algorithm evaluation in HEV EMS field.
 13 popular DRL algorithms are introduced with architecture diagram and both discrete and
continuous action space algorithms are considered.
 Key measures, including cumulative reward, convergence reward, convergence fuel economy,
convergence episode, and training time, are compared.

Nothing new in the detection section. According to my understanding, the 13 deep


reinforcement learning algorithms proposed in this manuscript are primarily basic
models. Therefore, the performance comparison of the models cannot be regarded as
the main innovation. Please provide something new and rewrite this section.
Response/ Action: Thanks reviewer for this comment. The main contribution of this work is the
benchmark and comparative analysis. This study is the first ever open source benchmark created
in the area of DRL-based EMS for HEV, which will avoid the researchers in this area re-invent
the wheels when they evaluate their new algorithms. Benchmark is a standard tool in the DRL
community to ensure a fair comparison when new algorithms are developed. Currently, due to the
lack of unified benchmark in the area of DRL-based EMS, many repeated effort are put by different
researchers.
In terms of revision, all the DRL algorithms are trained in WLTP combined driving cycle instead
of UDDS urban driving cycle in the revised manuscript to consider wider range of operating
conditions. In addition, validation/test in UDDS urban driving cycle and HWFET highway driving
cycle has been added as the new results to verify the trained agents in the unseen operating
conditions. In addition, SAC with continuous action space has been added in the training and
results analysis.
Revised Manuscript with No Changes Marked Click here to view linked References

1
2
3
4
A Comparative Study of 13 Deep Reinforcement
5
6 Learning Based Energy Management Methods for a
7
8
9
Hybrid Electric Vehicle
10
11
12
Hanchen Wang1, Yiming Ye2, Jiangfeng Zhang2, Bin Xu1, *
13
14 1: The University of Oklahoma, School of Aerospace and Mechanical Engineering, 865 Asp Ave, Norman, OK, 73019, USA.
15 2: Clemson University, Department of Automotive Engineering, 4 Research Dr., Greenville, SC, 29607, USA.
16 * Corresponding author: Bin Xu, The University of Oklahoma, School of Aerospace and Mechanical Engineering, 865 Asp Ave, Norman, OK,
17 73019, USA. (binxu@ou.edu)
18
19
20 Abstract – Energy management strategy (EMS) has a huge impact on the energy efficiency of hybrid electric vehicles (HEVs). Recently,
21 fast-growing number of studies have applied different deep reinforcement learning (DRL) based EMS for HEVs. However, a unified
22
23 performance review benchmark is lacking for most popular DRL algorithms. In this study, 13 popular DRL algorithms are applied as
24
25 HEV EMSs. The reward performance, computation cost, and learning convergence of different DRL algorithms are discussed. In
26
27 addition, HEV environments are modified to fit both discrete and continuous action spaces. The results show that the stability of agent
28
29 during the learning process of continuous action space is more stable than discrete action space. In the continuous action space, SAC has
30
the highest reward, and PPO has the lowest time cost. In discrete action space, DQN has the lowest time cost, and FQF has the highest
31
32 reward. The comparison among SAC, FQF, rule-based, and equivalent consumption minimization strategies (ECMS) shows that DRL
33
34 EMSs run the engine more efficiently, thus saving fuel consumption. The fuel consumption of FQF is 10.26% and 5.34% less than Rule-
35
36 based and ECMS, respectively. The contribution of this paper will speed up the application of DRL algorithms in the HEV EMS
37
38 application.
39
40 Keywords: Hybrid electric vehicle, Energy management strategy, Deep reinforcement learning.
41
42
43 1. Introduction
44
45
46
Driven by the emerging demand for low emission and less dependence on fossil fuel energy, vehicle original equipment
47
48
manufacturers (OEMs) and researchers work on many potential solutions. Hybrid electric vehicles (HEVs) are investigated for
49
50
their high energy efficiency and low emission. HEV combines an internal combustion engine (ICE) and one or more electric motors
51
52
for traction, which are fuel efficient and environmentally friendly [1].
53
54
55 When a vehicle contains multiple power sources, energy management strategy (EMS) is critical. By coordinating multiple
56
57 output power sources, vehicle fulfill the power demand during the driving. It has electric or power split modes [2]. The aim of
58
59 using EMS is to have a max powertrain system efficiency and have a low fuel consumption [3]. In the area of HEV EMS, there
60
61 are several popular methods, including rule-based method [4], Equivalent Consumption Minimization Strategy (ECMS) [5], Model
62 1
63
64
65
1
2 Predictive Control (MPC) [6], and Dynamic Programming (DP) [7]. The rule-based method uses expert knowledge and has the
3
4 least computation cost among all the EMSs. However, its dependence on expert knowledge leads to inconsistency and lack of
5
6 optimality. ECMS optimizes the equivalent fuel consumption at each time step. It can be easily implemented in real-time. The one
7
8 step optimization results in an inferior fuel-saving performance. Compare to ECMS, MPC consider multiple time steps and
9
10 optimization in a moving horizon [8]. MPC has better optimization performance but costs more computation resources due to its
11
12
multiple time step optimization. Also, MPC generates local optimal results and does not guarantee a global optimum. DP can
13
14
guarantee the global optimal solution, but it has a problem of high cost on computation resources and thus, it is usually implemented
15
16
offline. Some researchers combined Machine Learning with DP’s rules and tested in real time. However, during the implementation,
17
18
it was found that some key data information could be missed and affect the final result. The advantage of using Reinforcement
19
20
21 Learning (RL) supervisory control is that it costs lower computation resource than DP and can be applied in real-time.
22
23 In RL, the goal of the agent is to maximize the final result which is also the cumulative reward. The agent needs to distinguish
24
25 which actions have positive effect to the cumulative reward by using an error compare process. Each action has the ability of
26
27 effecting current and future rewards. The main steps of RL are that the agent receives the states from environment, then taking
28
29 specific actions, and receiving rewards and analyzing the effect of the action to update the model. One important factor during the
30
31 learning is the agent utilizes exploitation and exploration. Exploration is used to collect more information from the environment
32
33 and Exploitation is used to output an action based on current knowledge. The 𝜀-greedy algorithm is one of methods to balance the
34
35 exploitation and exploration. It chooses exploration and exploitation randomly with different probability 𝜀. In most time, the chance
36
37 of exploring is small, and it chooses the greedy action to get the highest reward by exploiting. Another important factor affecting
38
39 RL performance is the environment. Throughout the interactions with the environment, the agent stores the states and generate
40
41 suitable actions to improve the accumulative results. The environment utilized in RL is usually assumed to be a Markov Decision
42
43 Process, which means the conditional probability distribution of the future environment’s states only depends on the current state
44
45 instead of previous states. In the basic situation, an agent interacts with environment to receive states and rewards and outputs the
46
47 action. In the HEV energy-management problem, the environment model can be regarded as the driving conditions, powertrain
48
49 dynamics, etc. [9]. The agents are power-split controllers, which can be driven by different RL algorithms. The objective of this
50
51 controller is to search for a sequence of actions, so that vehicle fuel economy is optimal.
52
53 Recently, many RL-based HEV EMS studies have been conducted. In [10], the results comparison showed that temporal
54
55 difference learning algorithm had a good fuel saving performance in both real-world and testing cycles. The research also showed
56
57 that this power management policy does not need complete information of driving cycle in prior. In [11], EMS was implemented
58
59 on a plug-in HEV which discussed options to charge along the way. In [12], RL was utilized to minimize the total consumption of
60
61
62 2
63
64
65
1
2 fuel and electricity in a plug-in HEV. In [13], RL was used to implement EMS on a parallel HEV to generate prediction. Based on
3
4 the driving cycle data, the vehicle speed is predicted by Nearest Neighbors predictor and fuzzy predictor. Q-learning algorithm
5
6 was used for energy management. In [14] and [15], RL was utilized for control optimization based on Transition Probability Matrix
7
8 (TPM) which updated by Kullback–Leibler divergence. In [16], in order to improve the convergence performance, fast Q-learning
9
10 algorithm was implemented. Cloud computation was proposed to solve the problem of computational burden in learning process.
11
12
In addition, [17] came up with the idea called Fuzzy Q-learning algorithm where Fuzzy parameters are optimized by Q value
13
14
function and neural network is used for estimation. In [18], a fuzzy logic controller was added based on RL algorithms to improve
15
16
the accuracy of online energy management prediction. In [19], a new function which formed by weighted fuel and battery
17
18
consumption was added to Q-learning algorithm and applied on a 48v HEV. In [20], actor-critic framework is introduced to
19
20
21 optimized parameters in engine control logic. In [21], a model-free deep reinforcement learning method (DQN) was obtained for
22
23 a mild-HEV in ECMS with a traffic model built in SUMO. The results were compared with solutions from DP and A-ECMS. It
24
25 showed that deep reinforcement learning provides a more general framework when extra information and situations were
26
27 considered.
28
29 Besides the study of single agent RL algorithms, some research also applied multiple agent RL algorithms. Qi et al. [22]
30
31 compared natural DQN with dueling DQN in energy management on Plug-in HEV and dueling DQN shows a faster convergence
32
33 speed than nature DQN. Wu et al. [23] developed a DRL-based EMS for parallel plug-in hybrid electric bus in continuous space
34
35 using deep deterministic policy gradient (DDPG). Inuzuka et al. [24] used the proximal policy optimization (PPO) to solve the
36
37 continuous space problem and PPO learned robustly during the learning loop. In [25], Wu et al. utilized soft actor-critic (SAC)
38
39 DRL algorithm to allot the energy and SAC showed a relatively better performance in convergence and optimization. Zhou et al.
40
41 [26] developed a novel EMS using Twin Delayed DDPG (TD3) DRL algorithm and compared TD3 with DDPG, DQN to show its
42
43 advantage. Based on the literature review, a research gap is found, which is a unified benchmark of different DRL based EMSs for
44
45 electrified powertrain is lacking. In most existing studies, it only contained up to 2-3 DRL algorithms and showed the final reward
46
47 without computation cost information. As different DRL algorithms are studied in different EMS literature, it is crucial to have a
48
49 unified benchmark based on various popular DRL algorithms to make a fair comparison of their performance.
50
51 This study introduces an OPEN-AI Gym-like HEV model programmed in Python rather than Matlab/Simulink for better DRL-
52
53 based EMS evaluation. Both discrete and continuous action spaces are formulated to control the power split between engine and
54
55 electric motor. Then 13 popular DRL algorithms are introduced for comparison purposes. To facilitate the comparison,
56
57 hyperparameters of different DRL algorithms are unified, such as neural networks architecture, batch size, etc. Learning is
58
59 conducted on WLTP highway-urban combined driving cycle and test is conducted on UDDS urban driving cycle and HWFET
60
61 highway driving cycle. Fuel economy, computation cost, and convergence are discussed for continuous and discrete action space
62 3
63
64
65
1
2 algorithms. The best DRL algorithms of each action space type are then compared with baseline EMS. The contribution of this
3
4 work is summarized as follows:
5
6
7
 Unlike conventional vehicle models built in Matlab/Simulink, an OPEN-AI Gym like HEV propulsion model is built in
8
9
Python. Following the format of OPEN-AI Gym, this vehicle environment can be directly connected to all popular DRL
10
11
frameworks. The entire code sets utilized in this benchmark generation are made available on GitHub
12
13
14 (https://github.com/LittleWebCat/DRL-Base-EMS) so that this benchmark can be utilized in newly developed DRL algorithm
15
16 evaluation in HEV EMS field.
17
18  13 popular DRL algorithms are introduced with architecture diagram and both discrete and continuous action space algorithms
19
20 are considered.
21
22  Key measures, including cumulative reward, convergence reward, convergence fuel economy, convergence episode, and
23
24 training time, are compared.
25
26 The remainder of this paper is organized as follows. Section 2 will present the model of the vehicle propulsion system. Each
27
28 DRL algorithm is introduced in section 3 with clear background information and equations. Section 4 is divided into two parts;
29
30 one is simulation set up and the other is a comprehensive analysis of the result of the learning process. Subsection 4.1 shows how
31
32 the simulation is carried out. Subsection 4.2 analyzes the results of different DRL algorithms under continuous and discrete action
33
34 spaces, along with comparisons between different DRL algorithms. Some key parameter metrics are compared too. Section 5
35
36 discusses the results and ideas during the whole experiment. Finally, Section 6 presents the conclusion and future work.
37
38
39 2. Environment Modeling
40
41
42 In this modeling section, the vehicle propulsion system model based on a Toyota High lander Parallel Hybrid vehicle is
43
44 introduced. The model is from the FASTsim software tool developed by National Renewable Energy Laboratory (NREL) in Python
45
46 language [27]. The vehicle propulsion system architecture is given in Fig. 1. As shown in the figure, both the engine and electric
47
48 motor (EM) supply power to the front wheel. The key specification of the vehicle is given in Table 1. The power is mainly provided
49
50 by a 3.6L engine, whereas a small battery pack is utilized in regenerative braking and power assisting. Vehicle dynamics, battery,
51
52 engine, electric motor models are given in this section. Model validation can be found in the literature [28].
53
54
55
56
57
58
59
60
61
62 4
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14 Fig. 1. Vehicle propulsion system architecture.
15
16 Table 1 2016 Toyota Highlander Hybrid Vehicle specification.
17
18
19 Parameters Value
20
21 Vehicle overall weight 2403 kg
22
23 Internal combustion engine max power 172 kW
24
25 Electric motor max power 123 kW
26
27 Battery capacity 2 kWh
28
29 Aerodynamic drag coefficient 0.39
30
31 Vehicle front projection area 3.33 𝑚2
32
33 Tire radius 0.336 m
34
35 Tire rolling resistance coefficient 0.7
36
37 Wheel inertia 0.815 kg𝑚2
38
39
40
2.1 Vehicle Dynamics
41
42
43
44 The vehicle dynamics model aims to integrate all the power sources and convert them to vehicle acceleration. As shown in Eq.
45
46 (1), the vehicle overall power demand is calculated based all the power applied to the vehicle, including aerodynamic drag power
47
48 𝑃𝑎𝑒𝑟𝑜 , vehicle wheel rolling resistance power 𝑃𝑟𝑜𝑙𝑙 , vehicle ascending altitude power 𝑃𝑎𝑠𝑐𝑒𝑛 , vehicle inertia power 𝑃𝑖𝑛𝑒𝑟𝑡 , and
49
50 vehicle acceleration power 𝑃𝑎𝑐𝑐 . In Eq. (2), 𝜌𝑎𝑖𝑟 is the air density, 𝐶𝑑 is the air drag coefficient, 𝐴𝑓𝑟𝑜𝑛𝑡 is vehicle frontal projection
51
52 area, and 𝑣 is vehicle velocity. In Eq. (3), 𝐶𝑟 is rolling resistance coefficient, 𝑚 is vehicle overall weight, 𝑔 is gravity constant, and
53
54 𝜃 is road slop. In Eq. (5), ∆𝑡 is simulation time step, 𝐼𝑤ℎ𝑒𝑒𝑙 is vehicle one-wheel inertia, 𝑛𝑤ℎ𝑒𝑒𝑙 is the number of wheels, 𝑟𝑤ℎ𝑒𝑒𝑙 is
55
56 wheel radius, 𝑣𝑖+1 is the vehicle velocity at the next time step and 𝑣𝑖 is the vehicle velocity at current time step.
57
58
59 𝑃𝑑𝑚𝑑 = 𝑃𝑎𝑒𝑟𝑜 + 𝑃𝑟𝑜𝑙𝑙 + 𝑃𝑎𝑠𝑐𝑒𝑛 + 𝑃𝑖𝑛𝑒𝑟𝑡 + 𝑃𝑎𝑐𝑐 (1)
60
61
62 5
63
64
65
1
2 1
3 𝑃𝑎𝑒𝑟𝑜 = 𝜌𝑎𝑖𝑟 𝐶𝑑 𝐴𝑓𝑟𝑜𝑛𝑡 𝑣 3 (2)
2
4
5
6 𝑃𝑟𝑜𝑙𝑙 = 𝐶𝑟 𝑣𝑚𝑔 cos 𝜃 (3)
7
8
9
𝑃𝑎𝑠𝑐𝑒𝑛 = 𝑣𝑚𝑔 sin 𝜃 (4)
10
11
12
13 1 𝑣𝑖+1 2 1 𝑣𝑖 2
𝑃𝑖𝑛𝑒𝑟𝑡 = 𝐼𝑤ℎ𝑒𝑒𝑙 𝑛𝑤ℎ𝑒𝑒𝑙 ( ) − 𝐼𝑤ℎ𝑒𝑒𝑙 𝑛𝑤ℎ𝑒𝑒𝑙 ( ) (5)
14 2∆𝑡 𝑟𝑤ℎ𝑒𝑒𝑙 2∆𝑡 𝑟𝑤ℎ𝑒𝑒𝑙
15
16
17 𝑃𝑎𝑐𝑐 = 𝑚𝑣̇ . (6)
18
19
20 2.2 Battery
21
22
23 The battery power demand is the sum of EM power and vehicle auxiliary power as shown in Eq. (7). Actual battery power 𝑃𝑏𝑎𝑡
24
25 is different from battery power demand due to energy loss in internal resistance. The energy loss is integrated into battery efficiency
26
27 𝜂𝑏𝑎𝑡 . As shown in Eq. (8), battery efficiency is applied differently based on battery charging and discharging situations. Actual
28
29 battery power is connected to battery SOC via battery remaining capacity 𝐶𝑟𝑒𝑚𝑎𝑖𝑛 as shown in Eqs. (9)-(10). 𝐶𝑛𝑜𝑟𝑚 is the battery
30
31 nominal capacity.
32
33
34
𝑃𝑏𝑎𝑡,𝑑𝑚𝑑 = 𝑃𝐸𝑀,𝑒𝑙𝑒 + 𝑃𝑎𝑢𝑥 (7)
35
36
37
38 −𝑃𝑏𝑎𝑡,𝑑𝑚𝑑 ∗ 𝜂𝑏𝑎𝑡 , 𝑃𝑏𝑎𝑡 < 0
39 𝑃𝑏𝑎𝑡 ={ 𝑃𝑏𝑎𝑡,𝑑𝑚𝑑 (8)
, 𝑃𝑏𝑎𝑡 ≥ 0
40 𝜂𝑏𝑎𝑡
41
42
43 ̇
𝐶𝑟𝑒𝑚𝑎𝑖𝑛 = 𝑃𝑏𝑎𝑡 (9)
44
45
46 𝐶𝑟𝑒𝑚𝑎𝑖𝑛
47 𝑆𝑂𝐶𝑏𝑎𝑡 = (10)
48 𝐶𝑛𝑜𝑟𝑚
49
50
51 2.3 Internal Combustion Engine and Electric Motor
52
53
54 The EM power demand 𝑃𝐸𝑀,𝑑𝑚𝑑 is calculated based on vehicle power demand sign and battery SOC level as shown in Eq. (11).
55
56 When vehicle power demand is negative (i.e., braking), all the power go to the battery via EM regenerative braking function. The
57
58 vehicle power is discounted by transmission efficiency 𝜂𝑡𝑟𝑎𝑛𝑠 . When battery SOC is below reference SOC, EM power demand is
59
60 reduced to only match auxiliary power to reduce SOC drop. During the rest situation, EM power demand is determined by DRL
61
62 6
63
64
65
1
2 action 𝑎. After the calculation of the EM power demand, the ICE power demand will compensate the difference between vehicle
3
4 overall power and EM power demand as shown in Eq. (12). The actual EM and ICE power can be calculated using power demands
5
6 and efficiency as shown in Eqs. (13)-(14). ICE efficiency is a function of power as shown in Fig. 2(a) and it reaches peak efficiency
7
8
36% at 35kW power level. Like the ICE, EM efficiency is a function of EM power. As shown in Fig. 2(b), EM efficiency peaks at
9
10
95% at power level of 50kW. Efficiency is usually modeled as a function of speed and torque. However, it is modeled as a function
11
12
of power to reduce the computation cost of the model as described by FASTsim development team [27], which means the speed
13
14
and torque details are not provided by the model and only power values are provided for the ICE and EM.
15
16
17
18 𝑃𝑑𝑚𝑑 ∗ 𝜂𝑡𝑟𝑎𝑛𝑠 , 𝑃𝑑𝑚𝑑 < 0
19 𝑃𝐸𝑀,𝑑𝑚𝑑 = {𝑃𝐸𝑀,𝑚𝑎𝑥 (2𝑎 − 1), 𝑃𝑑𝑚𝑑 > 0, 𝑆𝑂𝐶 > 𝑆𝑂𝐶𝑟𝑒𝑓 (11)
20 𝑃𝑎𝑢𝑥 , 𝑃𝑑𝑚𝑑 > 0, 𝑆𝑂𝐶 ≤ 𝑆𝑂𝐶𝑟𝑒𝑓
21
22
23 0, 𝑃𝐸𝑀,𝑑𝑚𝑑 < 0
24 𝑃𝐼𝐶𝐸,𝑑𝑚𝑑 𝑃
= { 𝑑𝑚𝑑 (12)
25 − 𝑃𝐸𝑀,𝑑𝑚𝑑 , 𝑃𝐸𝑀,𝑑𝑚𝑑 ≥ 0
𝜂𝑡𝑟𝑎𝑛𝑠
26
27
28
𝑃𝐸𝑀,𝑑𝑚𝑑 ∗ 𝜂𝐸𝑀,𝑐ℎ𝑔 , 𝑃𝐸𝑀,𝑑𝑚𝑑 < 0
29
30 𝑃𝐸𝑀,𝑒𝑙𝑒 ={ 𝑃𝐸𝑀,𝑑𝑚𝑑 (13)
,𝑃 ≥0
31 𝜂𝐸𝑀,𝑑𝑖𝑠𝑐ℎ𝑔 𝐸𝑀,𝑑𝑚𝑑
32
33
34 𝑃𝐼𝐶𝐸,𝑑𝑚𝑑
𝑃𝐼𝐶𝐸 = (14)
35 𝜂𝐼𝐶𝐸
36
37
38 (a)
39 40
Efficiency [%]

40
30
41
42 20
43
44 10
45 0 50 100 150 200
46 Engine power [kW]
(b)
47 100
48
Efficiency [%]

95
49
50 90
51 85
52
80
53 0 20 40 60 80 100 120 140
54 Electric motor power [kW]
55
56
57
Fig. 2 Internal combustion engine and electric motor efficiency curve.
58
59
60
61
62 7
63
64
65
1
2 2.4 Energy management
3
4
5 The reward of DRL algorithms is defined in Eq. (15). It accounts for the energy consumption of battery and ICE. A negative
6
7 sign is added in Eq. (15) to convert the minimization problem to a maximization problem as we call the term reward rather than
8
9
loss or cost function. The energy consumption is calculated based on power and simulation time steps as shown in Eqs. (15)-(17)
10
11
12
𝐸𝑏𝑎𝑡 + 𝐸𝐼𝐶𝐸
13 𝑟=− (15)
14 𝐸𝑛𝑜𝑟𝑚
15
16
17 𝐸𝑏𝑎𝑡 = 𝑃𝑏𝑎𝑡 ∆𝑡 (16)
18
19
20 𝐸𝐼𝐶𝐸 = 𝑃𝐼𝐶𝐸 ∆𝑡 (17)
21
22
23 Action 𝑎 is applied in Eq. (11), and it is in the range of [0,1]. It determines the EM power demand and thus the ICE power demand
24
25 is the difference between overall power demand and EM power demand. For continuous DRL algorithms, action can be any value
26
27 in that range, whereas it can only be a specific number of values in that range for discrete DRL algorithms depending on the action
28
29 discretization level.
30
31 State vector of DRL algorithms includes vehicle power demand 𝑃𝑟𝑒𝑞 , vehicle speed 𝑣 and battery SOC. The range of the three
32
33 states is [-30kW,30kW], [0m/s,30m/s], and [40%,90%], respectively. Vehicle power demand provides information of vehicle
34
35
overall power level. Vehicle speed is important as it provides hidden information inside vehicle power demand. Given a fixed
36
37
vehicle power demand, it could be the results of low speed and high torque or high speed and low torque. Therefore, vehicle speed
38
39
provides important information about vehicle operating status. In addition, battery SOC is the most important variable indicating
40
41
the battery remaining energy status.
42
43
44
45 3. Deep Reinforcement Learning algorithms
46
47
48 In this section, 13 DRL algorithms are introduced for the HEV EMS application. Only key equations are presented due to the
49
50 limited space in this paper, and more details can be found in the respective references cited in each subsection.
51
52
53 3.1 Deep Q-networks (DQN)
54
55
56 A deep Q network (DQN) is a neural network with multiple layers which takes a given state 𝑠 as input and outputs action values
57
58 vector 𝑄(𝑠, 𝑎; 𝜃), where 𝜃 stands for the network’s parameters. According to the DQN-based flow chart (Fig. 3), it shows a basic
59
60 structure for the DQN-based DRL algorithms, which contains a replay buffer and two networks: evaluate network and target
61
62 8
63
64
65
1
2 network. The agent interacts with the HEV environment and stores the transactions 𝑇𝑡 = (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) in replay buffer 𝐵𝑡 =
3
4 {𝑇1 , 𝑇2 , … , 𝑇𝑡 }. The replay buffer sample mini-batches randomly and the evaluate network grabs a mini-batch from the replay buffer
5
6
7 to calculate the state-action value 𝑄(𝑠, 𝑎; 𝜃𝑖 ) and target network use mini-batch data to generate target Q value 𝑦𝑖𝐷𝑄𝑁 . Given two
8
9 outputs from neural network, the loss function is designed to update the neural network. The equation to calculate and optimize
10
11 the loss function at each iteration 𝑖 is given by
12
13
2
14 𝐿𝑖 (𝜃𝑖 ) = 𝐸 ′
[(𝑦𝑖𝐷𝑄𝑁 − 𝑄(𝑠, 𝑎; 𝜃𝑖 )) ], (18)
15 (𝑠,𝑎,𝑟,𝑠 )~𝒰(𝐷)
16
17
18
with
19
20
21
22 𝑦𝑖𝐷𝑄𝑁 = 𝑟 + 𝛾 max

𝑄(𝑠 ′ , 𝑎′ ; 𝜃 − ), (19)
𝑎
23
24
25 where 𝜃 – represents the parameters of a fixed and separate target network [29]. The target network’s parameter is fixed in several
26
27 iterations and updated after several iterations by using evaluate network’s parameters [30].
28
29 Experience replay improves the data efficiency by reusing the samples in multiple updates. In addition, it reduces variance as
30
31 uniform sampling from the replay buffer reduces the correlation among the samples used in the update. Experience replay also has
32
33
many updated versions, including Prioritized Experience Replay [31], as shown in Fig. 3 and Hindsight Experience Replay [32].
34
35
These updated versions are applied in different situations and better than the original Experience replay.
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58 Fig. 3 Architecture of DQN-based Algorithm.
59
60
61
62 9
63
64
65
1
2 3.2 Double DQN (D2QN)
3
4
5 In order to solve the problem in DQN that using the same value to evaluate action will generate an overestimate [33], Double
6
7 Deep Q-networks (DDQN) uses the action chosen by evaluating network as the input of the target network, target Q value is given
8
9
by,
10
11
12
13 𝑦𝑖𝐷𝐷𝑄𝑁 = 𝑟 + 𝛾𝑄(𝑠 ′ , 𝑎𝑟𝑔 max

𝑄(𝑠 ′ , 𝑎′ ; 𝜃𝑖 ); 𝜃 − ). (20)
𝑎
14
15
16 The difference between DDQN and DQN is the target network 𝑦𝑖𝐷𝐷𝑄𝑁 and 𝑦𝑖𝐷𝑄𝑁 [30], As shown in Fig. 3 and Fig. 4, the
17
18
different target networks will lead to different loss functions.
19
20
21
3.3 Dueling Double DQN (D3QN)
22
23
24
25 Dueling network has a different structure from a neural network, which contains the state value and advantage and then merges
26
27 to output [34]. The state value stands for the value of the state regardless of action, and the advantage value shows the advantages
28
29 of specific action over other actions given a specific state. The action value is given by
30
31 1
𝑄(𝑠, 𝑎; 𝜑) = 𝑉(𝑠; 𝜎) + (𝐴(𝑠, 𝑎; 𝛽) − ∑ 𝐴(𝑠, 𝑎′ ; 𝛽)), (21)
32 𝑁 ′
33 𝑎

34
35 where 𝜎 and 𝛽 stand for the neural network parameters, 𝜑 stands for {𝜎, 𝛽}, 𝑁 the number of actions [34].
36
37
38 3.4 Distributional RL (C51)
39
40
41
Distributional RL is based on DQN but uses value distribution instead of the value function. The agent interacts with the HEV
42
43
44 environment and stores trajectories into the experience replay buffer. Then sample trajectory (𝑠, 𝑎, 𝑟, 𝑠′) from the experience replay
45
46 buffer. The Bellman equation in C51 is expressed by,
47
48 𝒯 𝜋 𝑍(𝑠, 𝑎) ≔ 𝑅(𝑠, 𝑎) + 𝛾𝑃𝜋 𝑍(𝑠, 𝑎), (22)
49
50
51 where 𝑍 is the value distribution, 𝒯 is the Bellman optimality operator, and 𝑃𝜋 is an operator from Z to Z, 𝑃𝜋 𝑍(𝑠, 𝑎) ≔ 𝑍(𝑠 ′ , 𝑎′).
52
53 Based on the Bellman equation, the Q value of C51 is expressed by,
54
55
56 𝑄(𝑠 ′ , 𝑎) ≔ ∑ 𝑧𝑖 𝑝𝑖 (𝑠 ′ , 𝑎|𝜃), (23)
57 𝑖
58
59
60 where 𝑝𝑖 (𝑠 ′ , 𝑎) the probability mass on each atom, z is a vector with N atoms, given as,
61
62 10
63
64
65
1
2 𝑉𝑚𝑎𝑥 −𝑉𝑚𝑖𝑛
𝑧𝑖 = 𝑉𝑚𝑖𝑛 + 𝑖 for 𝑖 ∈ {0, … , 𝑁 − 1}. (24)
3 N−1
4
5
where 𝑉𝑚𝑎𝑥 and 𝑉𝑚𝑖𝑛 are fixed numbers based on the real task. The aim is to update 𝜃 to let the distribution get close to the real
6
7
8 return’s distribution. By using KL divergence, the loss is minimized KL divergence using gradient descent by a cross-entropy term
9
10 of
11
12 𝐷𝐾𝐿 (Φ𝑧 Z(s, 𝑎 ∗ )||𝑍(𝑠, 𝑎∗ )), (25)
13
14
15 where Φ𝑧 is the projection of target distribution onto the z, 𝑎∗ ← argmax 𝑄(𝑠 ′ , 𝑎) [35].
16 𝑎
17
18
19 3.5 Rainbow DQN
20
21
22 Rainbow is based on DQN and has six extensions to address limitations and improve overall performance [36]. The six
23
24 extensions are double Q-learning, prioritized replay, multi-step learning, dueling network, distributional RL and noise net. The
25
26 flow chart is shown in Fig. 4. As shown in the figure, the Rainbow agent uses a prioritized replay buffer to store the transaction
27
28 experience from the HEV environment instead of basic experience replay buffer and sample the mini batch for the evaluate neural
29
30 network. Rainbow uses a dueling network as the basic architecture and uses the double-Q networks structure and multi-steps target
31
32 to calculate the multi-steps target Q value 𝑦𝑖𝐷𝐷𝑄𝑁 . Out the six extensions, Double Q-learning and distributional RL are described
33
34 in section 3.4, and dueling network is described in section 3.3. The rest three extensions are explained as follows:
35
36
37 Prioritized Replay: One reason why DDQN has better performance than DQN is prioritized experience replay (PER) [31]. The
38
39
main contribution of PER is increasing the probability of high-expected experiences after being rewarded by TD-error. This
40
41
contribution can reduce training time and improve the accuracy of final results.
42
43
44
45 Multi-step learning: As shown in Fig. 4, Rainbow uses multi-step targets instead of the original single-step target. Instead of just
46
47 using one step to accumulate reward and the next step to bootstrap, multi-step learning uses next n-steps targets [37]. The n-steps
48
49 return is given by,
50
51 𝑛−1
52 (𝑛) (𝑘)
𝑅𝑡 ≡ ∑ 𝛾𝑡 𝑅𝑡+𝑘+1 . (26)
53
𝑘=0
54
55
56 Using the multi-step target, the loss function of Rainbow is given by,
57
58
2
59 (𝑛) (𝑛)
(𝑅𝑡 + 𝛾𝑡 𝑄𝜃′ (𝑆𝑡+𝑛 , 𝑎𝑟𝑔𝑚𝑎𝑥 𝑄𝜃 (𝑆𝑡+𝑛 , 𝑎′)) − 𝑄𝜃 (𝑆𝑡 , 𝐴𝑡 )) . (27)
60 𝑎′
61
62 11
63
64
65
1
2 Learning speed can be regulated by tuning the parameter n in multi-steps learning [38].
3
4
5 Noise Net: Due to the limitation of using 𝜖-greedy in games learning, the Noise Net was developed by combining different type of
6
7 noisy stream in a linear layer [39], given as,
8
9
10 𝑦 = (𝑎 + 𝑏𝑥) + (𝑎𝑛𝑜𝑖𝑠𝑦 ⨀𝜖 𝑎 + (𝑏𝑛𝑜𝑖𝑠𝑦 ⨀𝜖 𝑏 )𝑥), (28)
11
12
13 where 𝜖 𝑎 and 𝜖 𝑏 are random variables, and ⨀ means the element-wise product.
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37 Fig. 4 Architecture of Rainbow Algorithm.
38
39
40 3.6 Quantile Regression for Distribution RL (QR-DQN)
41
42
43 The C51 uses N fixed locations to estimate the probability. The difference between C51 and QR-DQN is that QR-DQN uses
44
45 fixed probabilities and different locations. Compared with C51, QR-DQN is not restricted or bounded for value and does not need
46
47 to do the projection operation [40]. QR-DQN also uses Wasserstein Metric to calculate the loss which is more precise. Compared
48
49 to C51, in QR-DQN, the quantile distribution is expressed as,
50
51
𝑁
52 1
53 𝑍𝜃 (𝑥, 𝑎) ∶= ∑ 𝛿𝜃𝑖(𝑥,𝑎) , (29)
𝑁
54 𝑖=1
55
56
where 𝑍𝜃 is quantile distribution, 𝜃𝑖 is the uniform probability distribution support, 𝛿 the Dirac. The Q value is similar to C51
57
58
given as,
59
60
61
62 12
63
64
65
1
2
3 𝑄(𝑠 ′ , 𝑎) ≔ ∑ 𝑞𝑗 𝜃𝑗 (𝑠 ′ , 𝑎), (30)
4 𝑗
5
6
where 𝑞𝑗 is uniform weights and 𝑞𝑗 = 1/𝑁. QR-DQN uses Huber loss to calculate the loss function, given as,
7
8
9 𝑁 𝑁
10 1
∑ ∑ [𝜌𝜏̂𝜅 (𝑟 + 𝛾𝜃𝑗 (𝑠 ′ , 𝑎 ∗ ) − 𝜃𝑖 (𝑠, 𝑎∗ ))] , (31)
11 𝑁
𝑖=1 𝑗=1
12
13
14 where 𝜌 the Huber Loss, 𝑎 ∗ ← argmax 𝑄(𝑠 ′ , 𝑎).
15 𝑎
16
17
18 3.7 Implicit quantile value network (IQN)
19
20
21 In IQN, the difference between QR-DQN and IQN is IQN adopts distortion risk measure and takes a 𝜏 ∈ [0,1] with state 𝑠 as
22
23 input of the neural network to generate a distribution as output [41]. Then the Q value is given by:
24
25
𝑁
26 1
27 𝑄𝛽 (𝑠, 𝑎) ≔ ∑ 𝑍𝜏𝑛 (s ′ , a′), (32)
𝑁
28 𝑛
29
30
where 𝛽: [0,1] the distortion risk measure, 𝑍𝜏 the distribution expressed by 𝑍𝜏 : = 𝐹𝑍−1 (𝜏). 𝐹𝑍−1 is the quantile function for variable
31
32
33 𝑍 given a base distribution 𝜏 ∈ [0,1]. In order to calculate the loss function, the network takes two base distributions
34
35 𝜏𝑖 , 𝜏𝑗 ~𝑈([0,1]) as input and generates two target distributions 𝑍𝜏𝑖 (𝑠, 𝑎) and 𝑟 + 𝛾𝑍𝜏𝑗 (𝑠′, 𝑎∗ ). The temporal difference is given as,
36
37 𝜏𝑖 ,𝜏𝑗
38 𝛿𝑡 = 𝑟 + 𝛾𝑍𝜏𝑗 (𝑠′, 𝑎∗ ) − 𝑍𝜏𝑖 (𝑠, 𝑎), (33)
39
40
1
41 where 𝑎 ∗ is the greedy function 𝑎 ∗ ← argmax ∑𝑁
𝑛 𝑍𝜏𝑛 (𝑠′, 𝑎′). And the loss function is given as,
𝑎′ 𝑁
42
43
44
𝑁 𝑁′
45 1 ′
𝜏𝑖 ,𝜏 𝑗
46 ∑ ∑ 𝜌𝜏𝐾𝑖 (𝛿𝑡 𝑟 + 𝛾𝑍𝜏𝑗 (𝑠′, 𝑎 ∗ ) − 𝑍𝜏𝑖 (𝑠, 𝑎)), (34)
47 𝑁′ 𝑖=1 𝑗=1
48
49
50
where 𝑁′ and 𝑁 are the numbers of samples.
51
52
53
54 3.8 Fully Parameterized Quantile Function (FQF)
55
56
57 In IQN the quantile fraction is sampled and in QR-DQN the quantile fraction is fixed, which limits the algorithm in real practice.
58
59 In FQF, a fraction proposal network is used to generate quantile fractions for each (𝑠, 𝑎) and quantile value network to match the
60
61
62 13
63
64
65
1
2 probability and quantile value [42]. This self-adjusted algorithm makes it better to approach the true distribution than IQN and
3
4 QR-DQN. The quantile expression is given as,
5
6
7 𝑁−1
8 Z(𝑠, 𝑎) ≔ ∑(𝜏𝑖+1 − 𝜏𝑖 )𝛿𝜃𝑖(𝑠,𝑎) , (35)
9 𝑖=0
10
11
12 where 𝛿 is the Dirac, 𝜏0 , 𝜏𝑖 , 𝜏𝑁−1 N-1 adjustable fraction with 𝜏0 = 0 and 𝜏𝑁 = 1. The quantile function 𝐹𝑍−1 is given by the inverse
13
14 of the cumulative distribution function 𝐹𝑍 , given as 𝐹𝑍−1 (𝑝) ≔ inf{z ∈ R: p ≤ 𝐹𝑍 (z)}. Then the distortion between approximated
15
16 and real quantile function using 1-Wasserstein is given as,
17
18 𝑁−1
19 𝜏+1
−1 (𝜑)
20 𝑊1 (𝑍, 𝜏, 𝜃) = ∑ ∫ |𝐹𝑍,𝜔1
− 𝜃𝑖 |𝑑𝜑 , (36)
𝑖=0 𝜏
21
22
23 where 𝜔1 is the parameter of the fraction proposal network. To minimize the distortion, 𝜏 and 𝜃 are needed, which are given by
24
25 𝜏𝑖 +𝜏𝑖+1
26 𝜃𝑖 = 𝐹𝑍−1 ( ). Then optimal 𝜏 can be calculated to minimize 𝑊1 (𝑍, 𝜏) and the 1-Wasserstein loss is given as,
2
27
28 𝑁−1
𝜏+1
29 −1 (𝜑) −1
𝜏𝑖 + 𝜏𝑖+1
𝑊1 (𝑍, 𝜏) = ∑ ∫ |𝐹𝑍,𝜔 − 𝐹𝑍,𝜔 ( )| 𝑑𝜑 . (37)
30
𝜏
1 1
2
31 𝑖=0
32
33 And minimize by applying gradients descent. The Q value of FQF is given as,
34
35
36 𝑁−1
−1
𝜏𝑖 + 𝜏𝑖+1
37 Q(𝑠, 𝑎) ≔ ∑(𝜏𝑖+1 − 𝜏𝑖 )𝐹𝑍,𝜔 ( ), (38)
38
2
2
𝑖=0
39
40
41 where 𝜔2 is the parameter of the quantile value network. Then the TD error of two probabilities is given as,
42
43 𝜏𝑖 + 𝜏𝑖+1 𝜏𝑗 + 𝜏𝑗+1
𝜏𝑖 ,𝜏𝑗
44 𝛿𝑡 = 𝑟 + 𝛾𝐹𝑍−1
′ ,𝜔 (
−1
) − 𝐹𝑍,𝜔 ( ). (39)
45
1 2 1
2
46
47 The action used is from the greedy function 𝑎 ∗ ← argmax Q(𝑠′, 𝑎′)And the loss function is given as,
48 𝑎′
49
50
𝑁 𝑁′
51
1 𝜏𝑖 + 𝜏𝑖+1 𝜏𝑗 + 𝜏𝑗+1
52 ∑ ∑ 𝜌𝜏𝐾𝑖 ( 𝑟 + 𝛾𝐹𝑍−1
′ ,𝜔 (
−1
) − 𝐹𝑍,𝜔 ( )), (40)
53 𝑁′ 𝑖=1 𝑗=1
1 2 1
2
54
55
56
57 where 𝜌 the Huber Loss, 𝑁′ and 𝑁 are the numbers of samples.
58
59
60
61
62 14
63
64
65
1
2 3.9 Deep Deterministic Policy Gradient (DDPG)
3
4
5 As shown in Fig. 5, DDPG contains two kinds of neural network architecture and experience replay buffer. The two kinds of
6
7 neural network are called Actor network and Critic network. In each network, it also contains two sub-networks which are online
8
9
network and target network [43]. The Actor network interacts with HEV environment and store the transactions 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 in
10
11
experience replay buffer. The experience buffer randomly sample mini batch 𝑠𝑖 , 𝑎𝑖 , 𝑟𝑖 , 𝑠𝑖+1 and feed into Actor and Critic network.
12
13
14 The Critic target network calculates expected target return 𝑦𝑖 by using action 𝜇′(𝑠𝑖+1 ) given by Actor target network.
15
16
𝑦𝑖 = 𝑟𝑖 + 𝛾𝑄′ (𝑠𝑖+1 , 𝜇 ′ (𝑠𝑖+1 )), (41)
17
18
19 where 𝛾 is the discount factor, 𝑄′ is Critic target network and 𝜇 ′ Actor target network. With the target Q value, the Critic loss can
20
21 be expressed by,
22
23
24 1 2
25 𝐿= ∑ (𝑦𝑖 − 𝑄(𝑠𝑖 , 𝑎𝑖 )) , (42)
𝑁 𝑖
26
27
28 where 𝑁 is the number of the mini batches. With the help of Critic network, the Actor policy is updated by the sampled policy
29
30 gradient, given by
31
32
33 1
∇𝜃 𝜇 𝐽 = ∑ [∇𝑎 𝑄(𝑠𝑖 , 𝜇(𝑠𝑖 ))∇𝜃𝜇 𝜇(𝑠𝑖 |𝜃𝜇 )] , (43)
34 𝑁 𝑖
35
36
37 where 𝜃𝜇 is the parameters of online Actor network. In order to improve the learning stability, the target network updates softly
38
39 and slowly with a small hyperparameter 𝜀,
40
41
42 𝜃 ′ ← 𝜀𝜃 + (1 − 𝜀)𝜃 ′ ,
43
44
45 𝜃𝜇 ′ ← 𝜀𝜃𝜇 + (1 − 𝜀)𝜃𝜇 ′ , (44)
46
47
48
where 𝜃 ′ is parameters of Critic target network, 𝜃 is parameters of Critic online network, 𝜃𝜇 ′ is parameters of Actor target network.
49
50
51 As an off-policy algorithm which combines the target network, experience replay buffer and on-policy actor-critic, sample
52
53 efficiency has improved. However, there are still some limitations that need to be solved.
54
55
56
57
58
59
60
61
62 15
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28 Fig. 5 Architecture of DDPG.
29
30
31 3.10 Twin-Delayed DDPG (TD3)
32
33
34 TD3 is built based on DDPG to solve limitations on approximation error and improve stability [44]. TD3 contains the techniques
35
36 of continuous Double Q learning, Policy Gradient and Actor Critic. The difference in structure between DDPG and TD3 is that
37
38 TD3 has two Critic networks which contain two online networks (𝑄1 , 𝑄2 ) and two target networks (𝑄′1 , 𝑄′2 ). Then the expected
39
40 target value is given by,
41
42
43 𝑦1 = 𝑟 + 𝛾𝑄′1 (𝑠′, 𝑎̅),
44
45
46 𝑦2 = 𝑟 + 𝛾𝑄′2 (𝑠′, 𝑎̅), (45)
47
48
where μ′ is Actor target network. In order to resolve overestimation, TD3 takes the minimum of two estimates as the expected
49
50
target value, given by
51
52
53 𝑦 = 𝑟 + 𝛾 min 𝑄′𝑖 (s′, 𝑎̅). (46)
54 𝑖=1.2
55
56 One key improvement in TD3 is target policy smoothing, which acts as a regulator to solve the problem of overfitting in Q value
57
58
computation by adding noise ε, given by,
59
60
61 𝑎̅ = μ′ (s ′ ) + ε, ε~clip(Ν(0, σ), −c, c). (47)
62 16
63
64
65
1
2 Given the target value, the loss function of the Critic network is given by,
3
4
5 1 2
𝐿1 = ∑ (𝑦𝑖 − 𝑄1 (𝑠𝑖 , 𝑎𝑖 )) ,
6 𝑁 𝑖
7
8
9 1 2
𝐿2 = ∑ (𝑦𝑖 − 𝑄2 (𝑠𝑖 , 𝑎𝑖 )) , (48)
10 𝑁 𝑖
11
12
13 where 𝐿1 , 𝐿2 is the loss function of Critic network 1 and Critic network 2. After backpropagating the loss and updating two Critic
14
15 networks, update Actor network by gradient ascent using the output of Critic network, i.e.,
16
17
1
18 ∇𝜃𝜇 𝐽 = ∑ [∇𝑎 𝑄1 (𝑠𝑖 , 𝜇(𝑠𝑖 ))∇𝜃𝜇 𝜇(𝑠𝑖 |𝜃𝜇 )] , (49)
19 𝑁 𝑖
20
21
where 𝜃𝜇 is the parameters of online Actor network. TD3 also uses the soft update method, given by,
22
23
24 𝜃′𝑖 ← 𝜀𝜃𝑖 + (1 − 𝜀)𝜃′𝑖 ,
25
26
27
𝜃𝜇 ′ ← 𝜀𝜃𝜇 + (1 − 𝜀)𝜃𝜇 ′ , (50)
28
29
30
31 where 𝜃′𝑖 is parameters of Critic target network, 𝜃𝑖 is parameters of Critic online network, 𝜃𝜇 ′ is parameters of Actor target
32
33 network.
34
35
36 3.11 Trust Region Policy Optimization (TRPO)
37
38
39 The main idea of TRPO is to update the policy by taking the farthest step while keeping the step in a constraint to make sure
40
41 that the new policy stays within an allowed distance from the old policy [45]. This constraint is expressed by KL-divergence. The
42
43 agent interacts with HEV environment to collect trajectories 𝐷 = {𝑠0 , 𝑎0 , … , 𝑎 𝑇−1 , 𝑠𝑇 } using policy network. Given the trajectories,
44
45 Critic network outputs the advantage value, given by,
46
47
48 𝐴̂𝑡 = −𝑉(𝑠𝑡 ) + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇−𝑡+1 𝑟𝑇−1 + 𝛾 𝑇−𝑡 𝑉(𝑠𝑇 ), (51)
49
50
51 where 𝑡 is the time index in [0, 𝑇], 𝑉 is the current value function. Based on the advantage value, the policy is updated given by,
52
53
𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
54 𝜃𝑘 = argmax 𝐸̂𝑡 [ 𝐴̂ ], (52)
55 𝜃 𝜋𝜃𝑜𝑙𝑑 (𝑎𝑡 |𝑠𝑡 ) 𝑡
56
57
58 s.t. 𝐸̂𝑡 [𝐷𝐾𝐿 (𝜋𝜃 (∙ |𝑠𝑡 )||𝜋𝜃𝑜𝑙𝑑 (∙ |𝑠𝑡 ))] ≤ 𝛿, (53)
59
60
61
62 17
63
64
65
1
2 where 𝜋𝜃𝑜𝑙𝑑 is the old policy before update, 𝐷𝐾𝐿 is the KL-divergence. Due to the complexity of the theoretical TRPO, some
3
4
approximations are made using Taylor’s series in the second order for a quicker learning, where the loss and KL-divergence are
5
6
given as,
7
8
9 𝐿(𝜃) ≈ 𝑔𝑇 (𝜃 − 𝜃𝑘 ),
10
11
12 1
̅𝐾𝐿 (𝜃||𝜃𝑘 ) ≈
𝐷 ̅𝐾𝐿 ≤ 𝛿
(𝜃 − 𝜃𝑘 )𝑇 𝐻(𝜃 − 𝜃𝑘 ), 𝐷 (54)
13 2
14
15
16 where 𝑔 is the policy gradient and 𝐻 is the measurement of relativity between policy and parameter 𝜃. This quadratic equation can
17
18 be solved given by,
19
20
21 2𝛿
22 𝜃𝑘+1 = 𝜃𝑘 + 𝛼 𝑗 √ 𝑇 −1 𝐻 −1 𝑔, (55)
𝑔 𝐻 𝑔
23
24
25
where 𝛼 is the backtracking coefficient and 𝑗 is the smallest nonnegative integer which makes policy 𝜋𝜃𝑘+1 satisfying the KL-
26
27
28 divergence constraint. Due to the difficulty of calculating and storing 𝐻 −1 , conjugate gradient algorithm is used to solve 𝐻 −1 by
29
30 using 𝑥 = 𝐻 −1 𝑔, and 𝜃𝑘+1 is then expressed as
31
32
33 2𝛿
𝜃𝑘+1 = 𝜃𝑘 + 𝛼 𝑗 √ 𝑥𝑘 . (56)
34 𝑥𝑘𝑇 𝐻𝑥𝑘
35
36
37 The Critic network is updated by MSE, given by,
38
39
40 ∅𝑘+1 = argmin 𝐸[𝑉(𝑠𝑡 |∅) − 𝑅𝑡 ], (57)
41 ∅
42
43
where ∅𝑘+1 is the parameter of the Critic network.
44
45
46
47 3.12 The Proximal Policy Optimization Algorithm (PPO)
48
49
50 PPO has the same aim as TRPO to address the problem of how to take the biggest step on policy improvement without going so
51
52 far which may cause collapse [46]. As shown in Fig. 6, agents interact with HEV environment to collect trajectories 𝐷 =
53
54 {𝑠0 , 𝑎0 , … , 𝑎 𝑇−1 , 𝑠𝑇 } using policy network. Given the trajectories, the Critic network outputs the advantage value, given by,
55
56
57 𝐴̂𝑡 = −𝑉(𝑠𝑡 ) + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇−𝑡+1 𝑟𝑇−1 + 𝛾 𝑇−𝑡 𝑉(𝑠𝑇 ), (58)
58
59
60
61
62 18
63
64
65
1
2 where 𝑡 is the time index in [0, 𝑇], 𝑉 is the current value function. Based on the advantage value, the loss function of policy is
3
4 given as,
5
6
7 𝜋𝜃 (𝑎𝑡 , 𝑠𝑡 ) 𝜋𝜃 (𝑎𝑡 , 𝑠𝑡 )
8 𝐿(𝜃) = 𝐸̂𝑡 [min ( 𝐴̂𝑡 , 𝑐𝑙𝑖𝑝 ( , 1 − 𝜖, 1 + 𝜖) 𝐴̂𝑡 )], (59)
𝜋𝜃𝑜𝑙𝑑 (𝑎𝑡 , 𝑠𝑡 ) 𝜋𝜃𝑜𝑙𝑑 (𝑎𝑡 , 𝑠𝑡 )
9
10
11 where 𝜋𝜃 and 𝜋𝜃𝑜𝑙𝑑 are new and old policies, 𝜖 is a hyperparameter which keeps the probability ratio in the interval [1 − 𝜖, 1 + 𝜖]
12
13
to control how far the new policy can go from the old policy. This gives a first-order method to trust region optimization, it makes
14
15
sure the agent will not be too greedy to the positive value actions and not ignore negative value actions very quickly. Then the
16
17
policy is updated by stochastic gradient ascent, given by,
18
19
20 𝜃𝑘+1 = argmax 𝐸[𝐿(𝜃)]. (60)
21 𝜃
22
23
24 The critic network is updated by MSE, given by,
25
26 ∅𝑘+1 = argmin 𝐸[𝑉(𝑠𝑡 |∅) − 𝑅𝑡 ], (61)
27 ∅
28
29
30 where ∅𝑘+1 is the parameter of the Critic network.
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55 Fig. 6 Architecture of PPO.
56
57
58
59
60
61
62 19
63
64
65
1
2 3.13 The Soft Actor critic (SAC)
3
4
5 SAC is an off-policy algorithm which combines actor-critic, clipped double-Q method, and entropy regularization [47]. The
6
7 agent interacts with HEV environment to generate experience trajectory and store in experience replay buffer. The experience
8
9
buffer randomly sample mini batch D = [𝑠𝑖 , 𝑎𝑖 , 𝑟𝑖 , 𝑠𝑖+1 ]. Entropy regularization is introduced in SAC to maximize the weight
10
11
between the expected return and the entropy that measures the policy’s randomness. The entropy will increase when more
12
13
14 exploration is needed. Entropy regularization can avoid policy from getting into an optimal policy in the beginning. Based on the
15
16 Bellman equation, the Q functions is given as,
17
18
19 𝑦(𝑠𝑖 , 𝑎𝑖 ) = 𝑟𝑖 + 𝛾 (min 𝑄𝜇,𝑗 (𝑠𝑖+1 , 𝜋 (∙ |𝑠𝑖+1 )) − 𝛼 log(𝜋 (𝜋 (∙ |𝑠𝑖+1 )|𝑠𝑖+1 ))), (62)
𝑗=1,2
20
21
22 where 𝜇 is the Q target network parameter, 𝜋 is the policy function. And the update of Q function using gradient descent is thus
23
24 given as
25
26
1
27 ∇∅,𝑖 ∑(𝑠𝑖 ,𝑎𝑖,𝑟𝑖,𝑠𝑖+1 )∈𝐷(𝑄∅,𝑖 (𝑠𝑖 , 𝑎𝑖 ) − 𝑦(𝑠𝑖 , 𝑎𝑖 ))2 , 𝑖 = 1,2 (63)
|𝐷|
28
29
30 where ∅ is the Q network parameter. The policy is updated by gradient ascent, giving as
31
32
33 1
∇𝜃 ∑ (min 𝑄∅,𝑖 (𝑠𝑖 , 𝜋 (∙ |𝑠𝑖 )) − 𝛼 log(𝜋 (𝜋 (∙ |𝑠𝑖+1 )|𝑠𝑖+1 )))2 , (64)
34 |𝐷| 𝑖=1,2
𝑠𝑖 ∈𝐷
35
36
37 where 𝜃 is the policy network parameter. The target network is updated using soft update, given as,
38
39
40 𝑄𝜇,𝑖 ← (1 − 𝜏)𝑄∅,𝑖 + 𝜏𝑄𝜇,𝑗 . (65)
41
42
43 4. Simulation and results
44
45
46 This section first presents the details of how the whole simulation setup, including environment information, neural network
47
48
setting, and algorithm’s learning parameters. Secondly, the result comparison is divided into three parts: first is based on continuous
49
50
action space using different DRL Algorithms, second is based on discrete action space with different numbers of discrete actions
51
52
using DQN, third is based on discrete action space using different DRL Algorithms. In each part, different algorithms are trained
53
54
in the same environment for the same episodes, and training curves are shown. Key data are recorded during the training and listed
55
56
57 in the table. Also, the performances of different algorithms are compared with each other in terms of computation time, reward,
58
59 and fuel economy. Finally, the best DRL algorithms of each action space type are compared with baseline EMS.
60
61
62 20
63
64
65
1
2 4.1 Simulation setup
3
4
5 The vehicle environment for the simulation is an Open AI Gym-like environment, and it shares function structure with Gym
6
7 environment. There are two versions of the vehicle environment that are only different in the action space: one has discrete action
8
9
space, and the other has continuous action space. Tianshou [48] is utilized for the DRL algorithms. To achieve a fair comparison,
10
11
we use the same hyperparameters (i.e., buffer-size is 20000, hidden-size is 128*128*128*128, batch-size is 64, learning-rate is
12
13
0.001, discount-factor is 0.9, etc.) for the neural network design. The computer used in this simulation has NVIDIA GeForce RTX
14
15
2070 GPU with Max-Q Design, Intel(R) Core (TM) i7-9750H CPU @ 2.60GHz and 16.0GB RAM. We open-sourced the code of
16
17
18 the paper at https://github.com/LittleWebCat/DRL-Base-EMS.
19
20
21 4.2 DRL results comparison
22
23
24 Fig. 7 depicts the track of the reward along the training process using four DRL algorithms in continuous action space. Each
25
26 curve is smoothed with a moving average of 10. The computation time is the time consumed during the entire learning process.
27
28 The convergence reward is the mean reward after convergence. The convergence episode is the episode when the reward reaches
29
30 convergence reward. The convergence time is the time when the agent reaches the convergence episode. The final test reward and
31
32 final test miles per gallon equivalent (MPGE) is the test result of three different driving cycles (UDDS, WLTP, HWFET) using
33
34 the best performance agent saved during the learning process in WLTP. MPGE considers the difference of initial and final SOC
35
36 of the battery by converting the SOC difference to gasoline using 33.7 kWh per gallon ratio provided by EPA [49]. These results
37
38 reveal that all five continuous DRL-based EMS converge within 100 episodes. The starting point for the training is different since
39
40 the initial parameters for the training are stochastic. The comparison of the result shows that PPO, TRPO and TD3 have higher
41
42 learning speeds at the beginning of the training. The reward of SAC changes slowly compared with the rest of the algorithms. As
43
44 we can see from Table 2, SAC has the highest convergence reward. However, the convergence speed of SAC is much slower than
45
46 other DRL algorithms. PPO has the fastest convergence speed compared to other algorithms. PPO requires 72.9%, 79.0%, 95.1%
47
48 and 91.3% less time consumption than TRPO, TD3, SAC and DDPG, respectively, to reach convergence. SAC’s convergence
49
50 reward is 0.50%, 0.06%, 0.09%, 0.04% higher than PPO, TRPO, TD3, DDPG respectively. We also can notice that TD3’s learning
51
52 curves are more stable than SAC, DDPG, PPO and TRPO after achieving convergence according to its flatness. For training the
53
54 same total episode, PPO saves 38.4%, 34.4%, 57.6% and 27.9% time than TRPO, TD3, SAC and DDPG, respectively. Also, from
55
56 the table record, SAC has the highest test reward and MPGE value in WLTP and UDDS. SAC’s test reward is 10.9%, 15.7%,
57
58 0.40%, 2.48% higher than PPO, TRPO, TD3, DDPG respectively in WLTP.
59
60
61
62 21
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32 Fig. 7 Comparison of DRL algorithms in continuous action space.
33
34
35 Table 2 Parameters of DRL algorithm in continue action space.
36
37
AGENT COMPUTATI CONVERGENCE CONVERGENCE CONVERGE UDDS TEST UDDS WLTP WLTP HWFET HWFE
38
ON TIME (S) REWARD EPISODE NCE TIME REWARD TEST TEST TEST TEST T
39
(S) MPGE REWARD MPGE REWAR TEST
40
D MPGE
41
42 PPO 866 -1412.2 11 39 -774.1 32.57 -1573.3 30.93 -932.2 37.11
43 TRPO 1406 -1406.1 21 144 -782.2 32.20 -1662.0 29.23 -933.6 37.01
44 TD3 1321 -1406.4 31 186 -602.7 41.84 -1406.8 34.56 -931.2. 37.16
45 SAC 2042 -1405.2 74 797 -601.5 41.94 -1401.2 34.68 -932.6 37.10
46
DDPG 1201 -1405.7 75 446 -605.8 41.62 -1436.8 33.81 -934.2 37.03
47
48
49 For the DRL with discrete action space, the action discretization is important and thus worth investigating. Fig. 8 shows the
50
51 learning curves of DQN under different discrete action spaces which contain 50, 200, 2000, and 5000 actions, respectively. Each
52
53 curve is smoothed with a moving average of 10. According to the figure, to reach convergence, DQN uses the least episodes in
54
55 action space 50 and most episodes in action space 5000. As shown in Table 3, the computation time does not have too much
56
57 difference among action spaces 50, 200, 2000 and 5000. However, the time cost to achieve convergence in action spaces 50 is
58
59 much less than in action spaces 200, 2000 and 5000. The final test reward of DQN in action space 2000 is the highest, which is
60
61
62 22
63
64
65
1
2 1.11%, 1.04%, and 0.09% higher than action spaces 50, 200, and 5000, respectively in WLTP. DQN applied in action space 2000
3
4 has the best fuel economy, which is 25.01 MPGE. In the following analysis, 2000 is selected as the action space discretization.
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37 Fig. 8 Comparison of DQN in different discrete action spaces
38
39
40 Table 3 Parameters of DQN algorithm in different discrete action spaces.
41
42
43 ACTION COMPUTATION CONVERGENCE CONVERGENCE CONVERGENCE UDDS UDDS WLTP WLTP HWFET HWFET

44 SPACE TIME (S) REWARD EPISODE TIME (S) TEST TEST TEST TEST TEST TEST

45 REWARD MPGE REWARD MPGE REWARD MPGE

46 50 1045 -1408 6 30 -599.2 42.09 -1404.9 34.58 -933.8 37.04


47 200 1041 -1412 28 146 -600.0 42.03 -1403.9 34.63 -932.9 37.09
48 2000 1067 -1402 31 155 -572.5 44.14 -1389.3 35.01 -932.5. 37.11
49
5000 1074 -1415 77 528 -546.9 46.24 -1390.6 35.01 -932.6 37.10
50
51
52
53 Fig. 9 depicts the track of the reward along the training process using different DRL algorithms in discrete action space, including
54
55 DQN, Categorical DQN (C51), QR-DQN, IQN, FQF, Rainbow DQN, SAC, Double DQN (D2QN), Dueling Double DQN (D3QN).
56
57 Each curve is smoothed with a moving average of 10. As we can see from Table 4, DQN, D2QN and IQN have similar computation
58
59 time, D3QN, SAC, and C51 have similar computation time, QR-DQN, Rainbow have similar computation time for the same total
60
61 episode. FQF has the best convergence reward, however, the convergence speed is much slower than DQN, D2QN, D3QN, IQN,
62 23
63
64
65
1
2 SAC and C51. In all algorithms, different RDL algorithms have different convergence speed. DQN reaches convergence fastest
3
4 using 179s, similar to DQN and IQN, slightly quicker than SAC, D3QN, C51, QR-DQN, and remarkably quicker than FQF and
5
6 Rainbow. Although the convergence speed of IQN, SAC, DQN, D2QN, and D3QN are much quicker, the convergence rewards
7
8 are lower than FQF and Rainbow. The D3QN is much more stable than DQN and D2QN during the learning process. In the training
9
10 curves, D3QN finally reaches an average result performance of around -1406.1, which is the most stable learning curve in all
11
12
algorithms. DQN saves 3.2%, 40.7%, 43.5%, 17.1%, 55.1%, 45.4%, 65.8%, 67.4% computation time than D2QN, D3QN, SAC,
13
14
IQN, C51, QR-DQN, FQF, and Rainbow, respectively, to reach convergence. FQF has the best convergence reward, which is
15
16
0.14%, 0.28%, 0.07% higher than QR-DQN, C51 and Rainbow, 1.34%, 0.21%, 0.14% higher than DQN, D2QN and D3QN
17
18
respectively, 0.35%, 0.64% higher than IQN and SAC. To train over the same total episode, DQN saves 12.0%, 27.4%, 3.4%,
19
20
21 33.4%, 43.4%, 17.2%, 51.4%, 56.8% time than IQN, SAC, D2QN, D3QN, FQF, C51, QR-DQN, Rainbow respectively. And FQF
22
23 has the best MPGE performance (35.49) compared to other algorithms in WLTP
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55 Fig. 9 Comparison of DRL algorithms in discrete action space
56
57
58 Table 4 Parameters of DRL algorithm in discrete action space.
59
60
61
62 24
63
64
65
1
2 AGENT COMPUTATION CONVERGENCE CONVERGENCE CONVERGENCE UDDS UDDS WLTP WLTP HWFET HWFET
3 TIME (S) REWARD EPISODE TIME (S) TEST TEST TEST TEST TEST TEST
REWARD MPGE REWARD MPGE REWARD MPGE
4 IQN 1167 -1409 35 216 -586.8 43.00 -1402.5 34.65 -931.2 37.16
5
6 SAC 1414 -1413 48 317 -574.7 43.94 -1386.7 35.06 -932.9 37.08
7
DQN 1027 -1423 31 179 -572.7 44.14 -1389.3 35.01 -932.6. 37.10
8
9 D2QN 1063 -1407 35 185 -575.0 43.92 -1396.3 34.82 -931.6 37.13
10
11 D3QN 1542 -1406 46 302 -602.8 41.84 -1406.1 34.56 -930.6 37.18
12
FQF 1813 -1404 56 523 -572.6 44.16 -1371.6 35.49 -934.4 37.03
13
14 C51 1241 -1408 63 399 -578.9 43.61 -1378.7 35.27 -933.7 37.05
15
16 QR-DQN 2113 -1406 32 328 -583.9 43.27 -1372.2 35.45 -933.9 37.05
17 RAINBOW 2378 -1405 60 549 -602.1 41.90 -1398.8 34.77 -932.4 37.11
18
19
20
21
4.3 DRL and baseline methods comparison
22
23
24
25 Besides the comparison among DRL algorithms, the comparison against conventional EMSs is also important to show the
26
27 difference between DRL and conventional EMSs. In this section, two popular conventional EMSs (i.e., Rule-based and ECMS)
28
29 are applied. More information on comparison is shown in Fig. 10. In Fig. 10(a), it shows the driving speed during one complete
30
31 UDDS driving cycle. Comparing Fig. 10(a), Fig. 10(b) and Fig. 10(c), the vehicle speed is determined by the Engine power and
32
33 Battery which means with increasing output of Engine and Battery power, the speed increase; On the contrary, when the vehicle
34
35 brake to decrease the speed, the Engine power decreases and the battery charges using the energy generated during vehicle braking.
36
37 Comparing Fig. 10(c) and Fig. 10(d), Engine power has the same shape of Engine fuel rate. Comparing Fig. 10(c) and Fig. 10(e),
38
39 the shape of Engine power and the negative part of Battery power is opposite, which means the battery is also charging when the
40
41 engine output power. Fig. 10(f) shows the accumulative fuel in one complete driving cycle. The Rule-based agent cost 521.6g,
42
43 ECMS agent cost 494.5g, SAC agent cost 478.5g, FQF agent cost 468.1g of fuel. The fuel economies of these four EMS are 38.2
44
45 MPGE, 40.3 MPGE, 41.9 MPGE, and 44.2 MPGE, respectively. The FQF agent has the least accumulative fuel consumption.
46
47 SAC’s fuel consumption is 8.26%, 3.24% less than Rule-based and ECMS. FQF’s fuel consumption is 10.26%, 5.34% less than
48
49 Rule-based and ECMS.
50
51
52
53
54
55
56
57
58
59
60
61
62 25
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32 Fig. 10 Comparison of SAC, FQF, rule-based, and ECMS EMSs in UDDS Test.
33
34
35 5. Discussion
36
37
38 Based on the fuel economy comparison, all the 13 DRL-based EMSs show higher MPGE than the rule-based and ECMS methods.
39
40 This observation is encouraging and indicates the better optimality performance of DRL methods than the conventional methods.
41
42 One interesting observation is that seven out of nine DRL algorithms with discrete action space exhibit higher rewards than all
43
44 four DRL algorithms with continuous action space. Even though DRL algorithms are general, their performance is application
45
46 dependent. This observation shows that the HEV EMS application favors DRL algorithms with discrete action space even though
47
48 some other applications favor DRL algorithms with continuous action space. In addition, the algorithms performing well in OPEN-
49
50 AI Gym environment do not guarantee good performance in the HEV EMS environment, such as RAINBOW and C51. Based on
51
52 the analysis of the results, FQF is recommended for the HEV EMS application for future research for its high reward. If
53
54 computation cost is also considered, DQN is the first option as it only requires 43.4% computation time of QR-DQN.
55
56 Even though DRL algorithms show promising reward optimization performance, some issues exist and require researchers’
57
58 attention. For example, the learning curve is extremely fluctuating, which leads to uncertainty in the final model after the entire
59
60 learning process. It is usually suggested to run the learning process multiple times and average the reward across the multiple runs
61
62 26
63
64
65
1
2 when plotting the learning curve. This only smooths the learning curve and makes it look better, while the uncertainty of the final
3
4 model is not addressed. One way to reduce the uncertainty of the final model is replacing the final model with the best model saved
5
6 in the middle of learning, which generates the highest reward during the learning process and is utilized in this paper. However,
7
8 even though the model generates the best reward, it may be the result of good luck in exploration and thus lacks generality when
9
10 environment conditions shift slightly (i.e., lacks robustness and adaptiveness to environment variation).
11
12
13
6. Conclusion
14
15
16
17 This paper introduces and benchmarks 13 popular deep reinforcement learning (DRL) algorithms in the HEV EMS application.
18
19 Both DRL algorithms with discrete and continuous action space are considered and compared. In the continuous actions space,
20
21 SAC has the highest MPGE. We found that the PPO takes the least time to finish the training and SAC has the highest test reward.
22
23 According to the training curves, SAC’s test reward is 10.9%, 15.7%, 0.40%, 2.48% higher than PPO, TRPO, TD3, DDPG
24
25 respectively in WLTP test. PPO saves 38.4%, 34.4%, 57.6% and 27.9% time than TRPO, TD3, SAC and DDPG, respectively. In
26
27 discrete action space, DQN has the lowest time cost, and FQF has the best test reward performance. DQN saves 12.0%, 27.4%,
28
29 3.4%, 33.4%, 43.4%, 17.2%, 51.4%, 56.8% time than IQN, SAC, D2QN, D3QN, FQF, C51, QR-DQN, Rainbow respectively.
30
31 FQF has the best convergence reward, which is 0.14%, 0.28%, 0.07% higher than QR-DQN, C51 and Rainbow, 1.34%, 0.21%,
32
33 0.14% higher than DQN, D2QN and D3QN respectively, 0.35%, 0.64% higher than IQN and SAC. Comparing TD3, QR-DQN,
34
35 rule-based and ECMS, the fuel consumption of SAC is 8.26% and 3.24% less than Rule-based and ECMS, respectively; while the
36
37 fuel consumption of FQF is 10.26% and 5.34% less than Rule-based and ECMS, respectively.
38
39 The future work will focus on improving the learning efficiency and robustness. The output actions in discrete action space are
40
41 discretized and may lead to violent oscillation after achieving a steady situation. In that case, an optimization of the neural network
42
43 for the DRL algorithms is needed in the future to contain a stable performance. Also, a test on a real vehicle will be implemented
44
45 in the future.
46
47
48 Reference
49
50 [1] B. M. Al-Alawi and T. H. Bradley, “Review of hybrid, plug-in hybrid, and electric vehicle market modeling Studies,”
51 Renewable and Sustainable Energy Reviews, vol. 21, pp. 190–203, May 2013, doi: 10.1016/j.rser.2012.12.048.
52
53 [2] G. Jinquan, H. Hongwen, P. Jiankun, and Z. Nana, “A novel MPC-based adaptive energy management strategy in plug-
54 in hybrid electric vehicles,” Energy, vol. 175, pp. 378–392, May 2019, doi: 10.1016/j.energy.2019.03.083.
55
56 [3] J. Guo, H. He, and C. Sun, “ARIMA-Based Road Gradient and Vehicle Velocity Prediction for Hybrid Electric Vehicle
57 Energy Management,” IEEE Transactions on Vehicular Technology, vol. 68, no. 6, pp. 5309–5320, Jun. 2019, doi:
58 10.1109/TVT.2019.2912893.
59
60 [4] F. Malmir, B. Xu, and Z. Filipi, “A Heuristic Supervisory Controller for a 48V Hybrid Electric Vehicle Considering
61 Fuel Economy and Battery Aging,” SAE Technical Papers, Dec. 2018, doi: 10.4271/2019-01-0079.
62 27
63
64
65
1
2 [5] P. Pisu and G. Rizzoni, “A Comparative Study Of Supervisory Control Strategies for Hybrid Electric Vehicles,” IEEE
3 Transactions on Control Systems Technology, vol. 15, no. 3, pp. 506–518, 2007, doi: 10.1109/TCST.2007.894649.
4
5 [6] H. A. Borhan, A. Vahidi, A. M. Phillips, M. L. Kuang, and I. V. Kolmanovsky, “Predictive energy management of a
6 power-split hybrid electric vehicle,” in 2009 American Control Conference, Jun. 2009, pp. 3970–3976. doi:
7 10.1109/ACC.2009.5160451.
8
9 [7] L. V. Pérez, G. R. Bossio, D. Moitre, and G. O. García, “Optimization of power management in an hybrid electric
10 vehicle using dynamic programming,” Mathematics and Computers in Simulation, vol. 73, no. 1–4, pp. 244–254, Nov. 2006,
11 doi: 10.1016/j.matcom.2006.06.016.
12
13 [8] F. Allgöwer and A. Zheng, Nonlinear Model Predictive Control. Birkhäuser, 2012.
14
15 [9] B. Xu et al., “Parametric study on reinforcement learning optimized energy management strategy for a hybrid electric
16 vehicle,” Applied Energy, vol. 259, p. 114200, Nov. 2019, doi: 10.1016/j.apenergy.2019.114200.
17
18 [10] X. Lin, Y. Wang, P. Bogdan, N. Chang, and M. Pedram, “Reinforcement learning based power management for hybrid
19 electric vehicles,” in 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2014, pp. 33–38. doi:
20 10.1109/ICCAD.2014.7001326.
21
22 [11] X. Qi, G. Wu, K. Boriboonsomsin, M. J. Barth, and J. Gonder, “Data-Driven Reinforcement Learning–Based Real-
23 Time Energy Management System for Plug-In Hybrid Electric Vehicles,” Transportation Research Record, vol. 2572, no. 1, pp.
24 1–8, Jan. 2016, doi: 10.3141/2572-01.
25
26 [12] C. Liu and Y. L. Murphey, “Power management for Plug-in Hybrid Electric Vehicles using Reinforcement Learning
27 with trip information,” in 2014 IEEE Transportation Electrification Conference and Expo (ITEC), Jun. 2014, pp. 1–6. doi:
28 10.1109/ITEC.2014.6861862.
29
30 [13] T. Liu, X. Hu, S. E. Li, and D. Cao, “Reinforcement Learning Optimized Look-Ahead Energy Management of a Parallel
31 Hybrid Electric Vehicle,” IEEE/ASME Transactions on Mechatronics, vol. 22, no. 4, pp. 1497–1507, 2017, doi:
32 10.1109/TMECH.2017.2707338.
33
34 [14] Y. Zou, T. Liu, D. Liu, and F. Sun, “Reinforcement learning-based real-time energy management for a hybrid tracked
35 vehicle,” Applied Energy, vol. 171, pp. 372–382, Jun. 2016, doi: 10.1016/j.apenergy.2016.03.082.
36
[15] R. Xiong, J. Cao, and Q. Yu, “Reinforcement learning-based real-time power management for hybrid energy storage
37
system in the plug-in hybrid electric vehicle,” Applied Energy, vol. 211, pp. 538–548, Feb. 2018, doi:
38
39 10.1016/j.apenergy.2017.11.072.
40 [16] G. Du, Y. Zou, X. Zhang, Z. Kong, J. Wu, and D. He, “Intelligent energy management for hybrid electric tracked
41
vehicles using online reinforcement learning,” Applied Energy, vol. 251, p. 113388, 2019, doi: 10.1016/j.apenergy.2019.113388.
42
43 [17] Y. Hu, W. Li, H. Xu, and G. Xu, “An Online Learning Control Strategy for Hybrid Electric Vehicle Based on Fuzzy Q-
44 Learning,” Energies, vol. 8, no. 10, Art. no. 10, Oct. 2015, doi: 10.3390/en81011167.
45
46 [18] J. Wu, Y. Zou, X. Zhang, T. Liu, Z. Kong, and D. He, “An Online Correction Predictive EMS for a Hybrid Electric
47 Tracked Vehicle Based on Dynamic Programming and Reinforcement Learning,” IEEE Access, vol. 7, pp. 98252–98266, 2019,
48 doi: 10.1109/ACCESS.2019.2926203.
49
50 [19] B. Xu, F. Malmir, and Z. Filipi, “Real-Time Reinforcement Learning Optimized Energy Management for a 48V Mild
51 Hybrid Electric Vehicle,” SAE Technical Papers, 2019, doi: 10.4271/2019-01-1208.
52
53 [20] P. Wang, Y. Li, S. Shekhar, and W. F. Northrop, “Actor-Critic based Deep Reinforcement Learning Framework for
54 Energy Management of Extended Range Electric Delivery Vehicles,” in 2019 IEEE/ASME International Conference on
55 Advanced Intelligent Mechatronics (AIM), Jul. 2019, pp. 1379–1384. doi: 10.1109/AIM.2019.8868667.
56
57 [21] Z. Zhu, Y. Liu, and M. Canova, “Energy Management of Hybrid Electric Vehicles via Deep Q-Networks,” in 2020
58 American Control Conference (ACC), Jul. 2020, pp. 3077–3082. doi: 10.23919/ACC45564.2020.9147479.
59
60
61
62 28
63
64
65
1
2 [22] X. Qi, Y. Luo, G. Wu, K. Boriboonsomsin, and M. Barth, “Deep reinforcement learning enabled self-learning control
3 for energy efficient driving,” Transportation Research Part C: Emerging Technologies, vol. 99, pp. 67–81, Feb. 2019, doi:
4
10.1016/j.trc.2018.12.018.
5
6 [23] Y. Wu, H. Tan, J. Peng, H. Zhang, and H. He, “Deep reinforcement learning of energy management with continuous
7 control strategy and traffic information for a series-parallel plug-in hybrid electric bus,” Applied Energy, vol. 247, pp. 454–466,
8 Aug. 2019, doi: 10.1016/j.apenergy.2019.04.021.
9
10 [24] S. Inuzuka, F. Xu, B. Zhang, and T. Shen, “Reinforcement Learning Based on Energy Management Strategy for
11 HEVs,” in 2019 IEEE Vehicle Power and Propulsion Conference (VPPC), 2019, pp. 1–6. doi:
12 10.1109/VPPC46532.2019.8952511.
13
14 [25] J. Wu, Z. Wei, W. Li, Y. Wang, Y. Li, and D. U. Sauer, “Battery Thermal- and Health-Constrained Energy
15 Management for Hybrid Electric Bus Based on Soft Actor-Critic DRL Algorithm,” IEEE Transactions on Industrial Informatics,
16 vol. 17, no. 6, pp. 3751–3761, Jun. 2021, doi: 10.1109/TII.2020.3014599.
17
18 [26] J. Zhou, S. Xue, Y. Xue, Y. Liao, J. Liu, and W. Zhao, “A novel energy management strategy of hybrid electric vehicle
19 via an improved TD3 deep reinforcement learning,” Energy, vol. 224, p. 120118, Jun. 2021, doi: 10.1016/j.energy.2021.120118.
20
21 [27] A. Brooker, J. Gonder, L. Wang, E. Wood, S. Lopp, and L. Ramroth, “FASTSim: A model to estimate vehicle
22 efficiency, cost and performance,” National Renewable Energy Lab.(NREL), Golden, CO (United States), 2015.
23
24 [28] C. Baker, M. Moniot, A. Brooker, L. Wang, E. Wood, and J. Gonder, “Future Automotive Systems Technology
25 Simulator (FASTSim) Validation Report – 2021,” Renewable Energy, p. 101, 2021.
26
27 [29] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning,” p. 9.
28
[30] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
29
30 Feb. 2015, doi: 10.1038/nature14236.
31 [31] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized Experience Replay,” arXiv:1511.05952 [cs], Feb. 2016,
32
Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/1511.05952
33
34 [32] M. Andrychowicz et al., “Hindsight Experience Replay,” in Advances in Neural Information Processing Systems, 2017,
35 vol. 30. Accessed: May 17, 2022. [Online]. Available:
36
https://proceedings.neurips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html
37
38 [33] H. Hasselt, “Double Q-learning,” in Advances in Neural Information Processing Systems, 2010, vol. 23. Accessed: Apr.
39 30, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-
40 Abstract.html
41
42 [34] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, “Dueling Network Architectures for
43 Deep Reinforcement Learning,” arXiv:1511.06581 [cs], Apr. 2016, Accessed: Jan. 28, 2022. [Online]. Available:
44 http://arxiv.org/abs/1511.06581
45
46 [35] M. G. Bellemare, W. Dabney, and R. Munos, “A Distributional Perspective on Reinforcement Learning.” arXiv, Jul. 21,
47 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1707.06887
48
49 [36] M. Hessel et al., “Rainbow: Combining Improvements in Deep Reinforcement Learning,” arXiv:1710.02298 [cs], Oct.
50 2017, Accessed: Jan. 28, 2022. [Online]. Available: http://arxiv.org/abs/1710.02298
51
52 [37] S. Sutton, “Predicting and Explaining Intentions and Behavior: How Well Are We Doing?,” Journal of Applied Social
53 Psychology, vol. 28, no. 15, pp. 1317–1338, 1998, doi: 10.1111/j.1559-1816.1998.tb01679.x.
54
55 [38] R. S. Sutton and A. G. Barto, Reinforcement Learning, second edition: An Introduction. MIT Press, 2018.
56
57 [39] M. Fortunato et al., “Noisy Networks for Exploration,” arXiv:1706.10295 [cs, stat], Jul. 2019, Accessed: Apr. 30, 2022.
58 [Online]. Available: http://arxiv.org/abs/1706.10295
59
60
61
62 29
63
64
65
1
2 [40] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Distributional Reinforcement Learning with Quantile
3 Regression,” arXiv:1710.10044 [cs, stat], Oct. 2017, Accessed: Feb. 01, 2022. [Online]. Available:
4
http://arxiv.org/abs/1710.10044
5
6 [41] W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit Quantile Networks for Distributional Reinforcement
7 Learning,” in Proceedings of the 35th International Conference on Machine Learning, Jul. 2018, pp. 1096–1105. Accessed: Feb.
8 05, 2022. [Online]. Available: https://proceedings.mlr.press/v80/dabney18a.html
9
10 [42] D. Yang, L. Zhao, Z. Lin, T. Qin, J. Bian, and T.-Y. Liu, “Fully Parameterized Quantile Function for Distributional
11 Reinforcement Learning,” in Advances in Neural Information Processing Systems, 2019, vol. 32. Accessed: Apr. 29, 2022.
12 [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/f471223d1a1614b58a7dc45c9d01df19-Abstract.html
13
14 [43] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning.” arXiv, Jul. 05, 2019. Accessed: Jul. 17,
15 2022. [Online]. Available: http://arxiv.org/abs/1509.02971
16
17 [44] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-Critic Methods,”
18 arXiv:1802.09477 [cs, stat], Oct. 2018, Accessed: Apr. 29, 2022. [Online]. Available: http://arxiv.org/abs/1802.09477
19
20 [45] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust Region Policy Optimization.” arXiv, Apr. 20,
21 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1502.05477
22
23 [46] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms.” arXiv,
24 Aug. 28, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1707.06347
25
26 [47] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement
27 Learning with a Stochastic Actor.” arXiv, Aug. 08, 2018. Accessed: Jul. 17, 2022. [Online]. Available:
28 http://arxiv.org/abs/1801.01290
29
30 [48] J. Weng et al., “Tianshou: a Highly Modularized Deep Reinforcement Learning Library.” arXiv, Sep. 22, 2021.
31 Accessed: Jul. 26, 2022. [Online]. Available: http://arxiv.org/abs/2107.14171
32
[49] O. US EPA, “Technology | Green Vehicle Guide.” https://www3.epa.gov/otaq/gvg/learn-more-technology.htm
33
34 (accessed Oct. 28, 2022).
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62 30
63
64
65
Revised Manuscript with changes Marked

1
2
3
4
5
6
7
8
9
10
11 A Comparative Study of 13 Deep Reinforcement
Learning Based Energy Management Methods for a
12
13
14
15 Hybrid Electric Vehicle
16
17 Hanchen Wang1, Yiming Ye2, Jiangfeng Zhang2, Bin Xu1, *
18
19 1: The University of Oklahoma, School of Aerospace and Mechanical Engineering, 865 Asp Ave, Norman, OK, 73019, USA.
2: Clemson University, Department of Automotive Engineering, 4 Research Dr., Greenville, SC, 29607, USA.
20 * Corresponding author: Bin Xu, The University of Oklahoma, School of Aerospace and Mechanical Engineering, 865 Asp Ave, Norman, OK,
21 73019, USA. (binxu@ou.edu)
22
23 Abstract – Energy management strategy (EMS) has a huge impact on the energy efficiency of hybrid electric vehicles (HEVs). Recently,
24 fast-growing number of studies have applied different deep reinforcement learning (DRL) based EMS for HEVs. However, a unified
25
performance review benchmark is lacking for most popular DRL algorithms. In this study, 13 popular DRL algorithms are applied as
26
27 HEV EMSs. The reward performance, computation cost, and learning convergence of different DRL algorithms are discussed. In
28 addition, HEV environments are modified to fit both discrete and continuous action spaces. The results show that the stability of agent Formatted: Font color: Red
29
during the learning process of continuous action space is more stable than discrete action space. In the continuous action space, SACTD3
30
31 has the highest reward, and PPODDPG has the lowest time cost. In discrete action space, DQN has the lowest time cost, and FQFQR-
32 DQN has the highest reward. The comparison among SACTD3, FQFQR-DQN, rule-based, and equivalent consumption minimization
33
strategies (ECMS) shows that DRL EMSs run the engine more efficiently, thus saving fuel consumption. The fuel consumption of
34
35 FQFQR-DQN is 10.264.97% and 5.3410.31% less than Rule-based and ECMS, respectively. The contribution of this paper will speed up
36 the application of DRL algorithms in the HEV EMS application.
37
38 Keywords: Hybrid electric vehicle, Energy management strategy, Deep reinforcement learning.
39
40 1. Introduction
41
42
Driven by the emerging demand for low emission and less dependence on fossil fuel energy, vehicle original equipment
43
44 manufacturers (OEMs) and researchers work on many potential solutions. Hybrid electric vehicles (HEVs) are investigated for
45
their high energy efficiency and low emission. HEV combines an internal combustion engine (ICE) and one or more electric motors
46
47 for traction, which are fuel efficient and environmentally friendly [1].
48
When a vehicle contains multiple power sources, energy management strategy (EMS) is critical. By coordinating multiple
49
50 output power sources, vehicle fulfill the power demand during the driving. It has electric or power split modes [2]. The aim of
51
using EMS is to have a max powertrain system efficiency and have a low fuel consumption [3]. In the area of HEV EMS, there
52
53 are several popular methods, including rule-based method [4], Equivalent Consumption Minimization Strategy (ECMS) [5], Model
54 1
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 Predictive Control (MPC) [6], and Dynamic Programming (DP) [7]. The rule-based method uses expert knowledge and has the
11
least computation cost among all the EMSs. However, its dependence on expert knowledge leads to inconsistency and lack of
12
13 optimality. ECMS optimizes the equivalent fuel consumption at each time step. It can be easily implemented in real-time. The one Formatted: Font color: Red
14
step optimization results in an inferior fuel-saving performance This one step optimization turns out a week performance in real
15
16 time. Compare to ECMS, MPC consider multiple time steps and optimization in a moving horizon [8]. MPC has better optimization
17
performance but costs more computation resources due to its multiple time step optimization. Also, MPC generates local optimal
18
19 results and does not guarantee a global optimum. DP can guarantee the global optimal solution, but it has a problem of high cost
20
on computation resources and thus, it is usually implemented offline. Some researchers combined Machine Learning with DP’s
21
22 rules and tested in real time. However, during the implementation, it was found that some key data information could be missed
23
and affect the final result. The advantage of using Reinforcement Learning (RL) supervisory control is that it costs lower
24
25 computation resource than DP and can be applied in real-time.
26
In RL, the goal of the agent is to maximize the final result which is also the cumulative reward. The agent needs to distinguish
27
28 which actions have positive effect to the cumulative reward by using an error compare process. Each action has the ability of
29
effecting current and future rewards. The main steps of RL are that the agent receives the states from environment, then taking
30
31 specific actions, and receiving rewards and analyzing the effect of the action to update the model. One important factor during the
32
learning is the agent utilizes exploitation and exploration. Exploration is used to collect more information from the environment
33
34 and Exploitation is used to output an action based on current knowledge. The 𝜀epsilon- greedy algorithm is one of methods to
35
balance the exploitation and exploration. It chooses exploration and exploitation randomly with different probability 𝜀. In most Formatted: Font color: Red
36
37 time, the chance of exploring is small, and it chooses the greedy action to get the highest reward by exploiting. Another important Formatted: Font color: Red
38 Formatted: Font color: Red
factor affecting RL performance is the environment. Throughout the interactions with the environment, the agent stores the states
39 Formatted: Font color: Red
40 and generate suitable actions to improve the accumulative results. The environment utilized in RL is usually assumed to be a Formatted: Font color: Red
41
Markov Decision Process, which means the conditional probability distribution of the future environment’s states only depends on
42
43 the current state instead of previous states. In the basic situation, an agent interacts with environment to receive states and rewards
44
and outputs the action. In the HEV energy-management problem, the environment model can be regarded as the driving conditions,
45
46 powertrain dynamics, etc. [9]. The agents are power-split controllers, which can be driven by different RL algorithms. The objective
47
of this controller is to search for a sequence of actions, so that vehicle fuel economy is optimal.
48
49 Recently, many RL-based HEV EMS studies have been conducted. In [10], the results comparison showed that temporal Formatted: Font color: Red
50
difference learning algorithm had a good fuel saving performance in saving fuel consumption in both real-world and testing cycles. Formatted: Font color: Red
51
52 The research also showed that this power management policy does not need complete information of driving cycle in prior.temporal
53
54 2
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 difference learning was introduced to control HEV. And it was showed that temporal difference learning had a relatively better
11
performance in convergence part and can be applied in non-Markovian environment. In [11], EMS was implemented on a plug-in
12
13 HEV which discussed options to charge along the way. In [12], RL was utilized to minimize the total consumption of fuel and
14
electricity in a plug-in HEV. In [13], RL was used to implement EMS on a parallel HEV to generate prediction. Based on the
15
16 driving cycle data, the vehicle speed is predicted by Nearest Neighbors predictor and fuzzy predictor. Q-learning algorithm was
17
used for energy management. In [14] and [15], RL was utilized for control optimization based on Transition Probability Matrix
18
19 (TPM) which updated by Kullback–Leibler divergence. In [16], in order to improve the convergence performance, fast Q-learning
20
algorithm was implemented. Cloud computation was proposed to solve the problem of computational burden in learning process.
21
22 In addition, [17] came up with the idea called Fuzzy Q-learning algorithm where Fuzzy parameters are optimized by Q value
23
function and neural network is used for estimation. In [18], a fuzzy logic controller was added based on RL algorithms to improve
24
25 the accuracy of online energy management prediction. In [19], a new function which formed by weighted fuel and battery
26
consumption was added to Q-learning algorithm and applied on a 48v HEV. In [20], actor-critic framework is introduced to
27
28 optimized parameters in engine control logic. In [21], researchers used deep reinforcement learning to train the agent and compared
29
against DP, ECMS and DQN .a model-free deep reinforcement learning method (DQN) was obtained for a mild-HEV in ECMS
30
31 with a traffic model built in SUMO. The results were compared with solutions from DP and A-ECMS. It showed that deep
32
reinforcement learning provides a more general framework when extra information and situations were considered.
33
34 Besides the study of single agent RL algorithms, some research also applied multiple agent RL algorithms. Qi et al. [22]
35
compared natural DQN with dueling DQN in energy management on Plug-in HEV and dueling DQN shows a faster convergence
36
37 speed than nature DQN. Wu et al. [23] developed a DRL-based EMS for parallel plug-in hybrid electric bus in continuous space
38
using deep deterministic policy gradient (DDPG). Inuzuka et al. [24] used the proximal policy optimization (PPO) to solve the
39
40 continuous space problem and PPO learned robustly during the learning loop. In [25], Wu et al. utilized soft actor-critic (SAC)
41
DRL algorithm to allot the energy and SAC showed a relatively better performance in convergence and optimization. Zhou et al.
42
43 [26] developed a novel EMS using Twin Delayed DDPG (TD3) DRL algorithm and compared TD3 with DDPG, DQN to show its
44
advantage. Based on the literature review, a research gap is found, which is a unified benchmark of different DRL based EMSs for
45
46 electrified powertrain is lacking. However, iIn most existing studies, it only contained up to 2-3 DRL algorithms and showed the Formatted: Font color: Red
47
final reward without computation cost information. As different DRL algorithms are studied in different EMS literature, it is crucial
48
49 to have a unified benchmark based on various popular DRL algorithms to make a fair comparison of their performance.
50
Since a unified benchmark of different DRL algorithms is lacking, Tthis study introduces an OPEN-AI Gym-like HEV model Formatted: Font color: Red
51
52 programmed in Python rather than Matlab/Simulink for better DRL-based EMS evaluation. Both discrete and continuous action
53
spaces are formulated to control the power split between engine and electric motor. Then 13 popular DRL algorithms are introduced
54 3
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 for comparison purposes. To facilitate the comparison, hyperparameters of different DRL algorithms are unified, such as neural
11
12 networks architecture, batch size, etc. Learning is conducted on a WLTP highway-urban conbinedcombined driving cycle and test
13 is conducted on UDDS urban driving cycle and HWFET highway driving cycle. Fuel economy, computation cost, and convergence
14
15 are discussed for continuous and discrete action space algorithms. The best DRL algorithms of each action space type are then
16 compared with baseline EMS. The contribution of this work is summarized as follows:
17
18
 Unlike conventional vehicle models built in Matlab/Simulink, an OPEN-AI Gym like HEV propulsion model is built in
19
20 Python. Following the format of OPEN-AI Gym, this vehicle environment can be directly connected to all popular DRL
21 frameworks. The entire code sets utilized in this benchmark generation are made available on GitHub
22
23 (https://github.com/LittleWebCat/DRL-Base-EMS) so that this benchmark can be utilized in newly developed DRL algorithm
24 evaluation in HEV EMS field.
25
26  Unlike conventional vehicle models built in Matlab/Simulink, an OPEN-AI Gym like HEV propulsion model is built in
27 Python.
28
29  13 popular DRL algorithms are introduced with architecture diagram and both discrete and continuous action space algorithms
30 are considered.
31
32  Key measures, including cumulative reward, convergence reward, convergence fuel economy, convergence episode, and
33 computation training time, are compared.
34
35 Unlike conventional vehicle models built in Matlab/Simulink, an OPEN-AI Gym like HEV propulsion model is built in
36 Python.
37
38  The entire code sets utilized in this benchmark generation are made available on GitHub Formatted: Font: (Default) Times New Roman, 10 pt
39 (https://github.com/LittleWebCat/DRL-Base-EMS) so that this benchmark can be utilized in newly developed DRL algorithm Formatted: Space After: 8 pt
40
41 evaluation in HEV EMS field.
42
43 The remainder of this paper is organized as follows. Section 2 will present the model of the vehicle propulsion system. Each
44 DRL algorithm is introduced in section 3 with clear background information and equations. Section 4 is divided into two parts ;
45
46 one is simulation set up and the other is a comprehensive analysis of the result of the learning process. Subsection 4.1 shows how
47 the simulation is carried out. Subsection 4.2 analyzes the results of different DRL algorithms under continuous and discrete action
48
49 spaces, along with comparisons between different DRL algorithms. Some key parameter metrics are compared too. Section 5
50 discusses the results and ideas during the whole experiment. Finally, Section 6 presents the conclusion and future work.
51
52
53
54 4
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 2. Environment Modeling
11
12 In this modeling section, the vehicle propulsion system model based on a Toyota High lander Parallel Hybrid vehicle is
13
14 introduced. The model is from the FASTsim software tool developed by National Renewable Energy Laboratory (NREL) in Python
15 language [27]. The vehicle propulsion system architecture is given in Fig. 1, . As shown in the figure, both the engine and electric
16
17 motor (EM) supply power to the front wheel. and tThe key specification of the vehicle is given in Table 1. The power is mainly
18 provided by a 3.6L engine, whereas a small battery pack is utilized in regenerative braking and power assisting. Vehicle dynamics,
19
20 battery, engine, electric motor models are given in this section. Model validation can be found in the literature [28].
21
22
23
24
25
26
27
28
29
30
31 Fig. 1. Vehicle propulsion system architecture.
32 Table 1 2016 Toyota Highlander Hybrid Vehicle specification.
33
34
Parameters Value
35
36 Vehicle overall weight 2403 kg
37
Internal combustion engine max power 172 kW
38
39 Electric motor max power 123 kW
40
Battery capacity 2 kWh
41
42 Aerodynamic drag coefficient 0.39
43
Vehicle front projection area 3.33 𝑚2
44
45 Tire radius 0.336 m
46
Tire rolling resistance coefficient 0.7
47
48 Wheel inertia 0.815 kg𝑚2
49
50 2.1 Vehicle Dynamics
51
52
The vehicle dynamics model aims to integrate all the power sources and convert them to vehicle acceleration. As shown in Eq.
53
54 5
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 (1), the vehicle overall power demand is calculated based all the power applied to the vehicle, including aerodynamic drag power
11
𝑃𝑎𝑒𝑟𝑜 , vehicle wheel rolling resistance power 𝑃𝑟𝑜𝑙𝑙 , vehicle ascending altitude power 𝑃𝑎𝑠𝑐𝑒𝑛 , vehicle inertia power 𝑃𝑖𝑛𝑒𝑟𝑡 , and
12
13 vehicle acceleration power 𝑃𝑎𝑐𝑐 . In Eq. (2), 𝜌𝑎𝑖𝑟 is the air density, 𝐶𝑑 is the air drag coefficient, 𝐴𝑓𝑟𝑜𝑛𝑡 is vehicle frontal projection
14
15 area, and 𝑣 is vehicle velocity. In Eq. (3), 𝐶𝑟 is rolling resistance coefficient, 𝑚 is vehicle overall weight, 𝑔 is gravity constant, and
16 𝜃 is road slop. In Eq. (5), ∆𝑡 is simulation time step, 𝐼𝑤ℎ𝑒𝑒𝑙 is vehicle one-wheel inertia, 𝑛𝑤ℎ𝑒𝑒𝑙 is the number of wheels, 𝑟𝑤ℎ𝑒𝑒𝑙 is
17
18 wheel radius, 𝑣𝑖+1 is the vehicle velocity at the next time step and 𝑣𝑖 is the vehicle velocity at current time step.
19
20 𝑃𝑑𝑚𝑑 = 𝑃𝑎𝑒𝑟𝑜 + 𝑃𝑟𝑜𝑙𝑙 + 𝑃𝑎𝑠𝑐𝑒𝑛 + 𝑃𝑖𝑛𝑒𝑟𝑡 + 𝑃𝑎𝑐𝑐 (1)
21
22 1
23 𝑃𝑎𝑒𝑟𝑜 = 𝜌 𝐶 𝐴 𝑣3 (2)
2 𝑎𝑖𝑟 𝑑 𝑓𝑟𝑜𝑛𝑡
24
25
𝑃𝑟𝑜𝑙𝑙 = 𝐶𝑟 𝑣𝑚𝑔 cos 𝜃 (3)
26
27
28 𝑃𝑎𝑠𝑐𝑒𝑛 = 𝑣𝑚𝑔 sin 𝜃 (4)
29
30 1 𝑣𝑖+1 2 1 𝑣𝑖 2
𝑃𝑖𝑛𝑒𝑟𝑡 = 𝐼𝑤ℎ𝑒𝑒𝑙 𝑛𝑤ℎ𝑒𝑒𝑙 ( ) − 𝐼𝑤ℎ𝑒𝑒𝑙 𝑛𝑤ℎ𝑒𝑒𝑙 ( ) (5)
31 2∆𝑡 𝑟𝑤ℎ𝑒𝑒𝑙 2∆𝑡 𝑟𝑤ℎ𝑒𝑒𝑙
32
33 𝑃𝑎𝑐𝑐 = 𝑚𝑣̇ . (6)
34
35
2.2 Battery
36
37
38 The battery power demand is the sum of EM power and vehicle auxiliary power as shown in Eq. (7). Actual battery power 𝑃𝑏𝑎𝑡
39 is different from battery power demand due to energy loss in internal resistance. The energy loss is integrated into battery efficiency
40
41 𝜂𝑏𝑎𝑡 . As shown in Eq. (8), battery efficiency is applied differently based on battery charging and discharging situations. Actual
42 battery power is connected to battery SOC via battery remaining capacity 𝐶𝑟𝑒𝑚𝑎𝑖𝑛 as shown in Eqs. (9)-(10). 𝐶𝑛𝑜𝑟𝑚 is the battery
43
44 nominal capacity.
45
46 𝑃𝑏𝑎𝑡,𝑑𝑚𝑑 = 𝑃𝐸𝑀,𝑒𝑙𝑒 + 𝑃𝑎𝑢𝑥 (7)
47
48
−𝑃𝑏𝑎𝑡,𝑑𝑚𝑑 ∗ 𝜂𝑏𝑎𝑡 , 𝑃𝑏𝑎𝑡 < 0
49 𝑃𝑏𝑎𝑡 = { 𝑃𝑏𝑎𝑡,𝑑𝑚𝑑 (8)
50 , 𝑃𝑏𝑎𝑡 ≥ 0
𝜂𝑏𝑎𝑡
51
52
̇
𝐶𝑟𝑒𝑚𝑎𝑖𝑛 = 𝑃𝑏𝑎𝑡 (9)
53
54 6
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 𝐶𝑟𝑒𝑚𝑎𝑖𝑛
𝑆𝑂𝐶𝑏𝑎𝑡 = (10)
11 𝐶𝑛𝑜𝑟𝑚
12
13 2.3 Internal Combustion Engine and Electric Motor
14
15
The EM power demand 𝑃𝐸𝑀,𝑑𝑚𝑑 is calculated based on vehicle power demand sign and battery SOC level as shown in Eq. (11).
16
17 When vehicle power demand is negative (i.e., braking), all the power go to the battery via EM regenerative braking function. The
18
vehicle power is discounted by transmission efficiency 𝜂𝑡𝑟𝑎𝑛𝑠 . When battery SOC is below reference SOC, EM power demand is
19
20 reduced to only match auxiliary power to reduce SOC drop. During the rest situation, EM power demand is determined by DRL
21
22 action 𝑎. After the calculation of the EM power demand, the ICE power demand will compensate the difference between vehicle

23 overall power and EM power demand as shown in Eq. (12). The actual EM and ICE power can be calculated using power demands
24
25 and efficiency as shown in Eqs. (13)-(14). ICE efficiency is a function of power as shown in Fig. 2(a) and it reaches peak efficiency

26 36% at 35kW power level. Like the ICE, EM efficiency is a function of EM power. As shown in Fig. 2(b), EM efficiency peaks at
27
95% at power level of 50kW. Efficiency is usually modeled as a function of speed and torque. However, iIt is modeled as a function Formatted: Font color: Red
28
29 of power to reduce the computation cost of the model as described by FASTsim development team [27], which means the speed
30
and torque details are not provided by the model and only power values are provided for the ICE and EM..
31
32
33 𝑃𝑑𝑚𝑑 ∗ 𝜂𝑡𝑟𝑎𝑛𝑠 , 𝑃𝑑𝑚𝑑 < 0
34 𝑃𝐸𝑀,𝑑𝑚𝑑 = {𝑃𝐸𝑀,𝑚𝑎𝑥 (2𝑎 − 1), 𝑃𝑑𝑚𝑑 > 0, 𝑆𝑂𝐶 > 𝑆𝑂𝐶𝑟𝑒𝑓 (11)
𝑃𝑎𝑢𝑥 , 𝑃𝑑𝑚𝑑 > 0, 𝑆𝑂𝐶 ≤ 𝑆𝑂𝐶𝑟𝑒𝑓
35
36
37 0, 𝑃𝐸𝑀,𝑑𝑚𝑑 < 0
38 𝑃𝐼𝐶𝐸,𝑑𝑚𝑑 = { 𝑃𝑑𝑚𝑑 (12)
− 𝑃𝐸𝑀,𝑑𝑚𝑑 , 𝑃𝐸𝑀,𝑑𝑚𝑑 ≥ 0
39 𝜂𝑡𝑟𝑎𝑛𝑠
40
41 𝑃𝐸𝑀,𝑑𝑚𝑑 ∗ 𝜂𝐸𝑀,𝑐ℎ𝑔 , 𝑃𝐸𝑀,𝑑𝑚𝑑 < 0
42 𝑃𝐸𝑀,𝑒𝑙𝑒 = { 𝑃𝐸𝑀,𝑑𝑚𝑑 (13)
,𝑃 ≥0
43 𝜂𝐸𝑀,𝑑𝑖𝑠𝑐ℎ𝑔 𝐸𝑀,𝑑𝑚𝑑
44
45 𝑃𝐼𝐶𝐸 =
𝑃𝐼𝐶𝐸,𝑑𝑚𝑑
(14)
𝜂𝐼𝐶𝐸
46
47
48
49
50
51
52
53
54 7
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11 (a)
40
12
Efficiency [%]

13 30
14
15 20
16
17 10
18 0 50 100 150 200
19 Engine power [kW]
20 (b)
100
21
Efficiency [%]

22 95
23
90
24
25 85
26 80
27 0 20 40 60 80 100 120 140
28 Electric motor power [kW]
29
30
31 Fig. 2 Internal combustion engine and electric motor efficiency curve.
32
33 2.4 Energy management
34
35 The reward of DRL algorithms is defined in Eq. (15). It accounts for the energy consumption of battery and ICE. A negative
36
37 sign is added in Eq. (15) to convert the minimization problem to a maximization problem as we call the term reward rather than
38 loss or cost function. The energy consumption is calculated based on power and simulation time steps as shown in Eqs. (15)-(17)
39
40
𝐸𝑏𝑎𝑡 + 𝐸𝐼𝐶𝐸
41 𝑟=− (15)
𝐸𝑛𝑜𝑟𝑚
42
43
44 𝐸𝑏𝑎𝑡 = 𝑃𝑏𝑎𝑡 ∆𝑡 (16)
45
46 𝐸𝐼𝐶𝐸 = 𝑃𝐼𝐶𝐸 ∆𝑡 (17)
47
48 Action 𝑎 is applied in Eq. (11), and it is in the range of [0,1]. It determines the EM power demand and thus the ICE power demand
49
50 is the difference between overall power demand and EM power demand. For continuous DRL algorithms, action can be any value
51 in that range, whereas it can only be a specific number of values in that range for discrete DRL algorithms depending on the action
52
53 discretization level.

54 8
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 State vector of DRL algorithms includes vehicle power demand 𝑃𝑟𝑒𝑞 , vehicle speed 𝑣 and battery SOC. The range of the three
11
12 states is [-30kW,30kW], [0m/s,30m/s], and [40%,90%], respectively. Vehicle power demand provides information of vehicle
13 overall power level. Vehicle speed is important as it provides hidden information inside vehicle power demand. Given a fixed
14
15 vehicle power demand, it could be the results of low speed and high torque or high speed and low torque. Therefore, vehicle speed
16 provides important information about vehicle operating status. In addition, battery SOC is the most important variable indicating
17
18 the battery remaining energy status.
19
20 3. Deep Reinforcement Learning algorithms
21
22
In this section, 13 DRL algorithms are introduced for the HEV EMS application. Only key equations are presented due to the
23
24 limited space in this paper, and more details can be found in the respective references cited in each subsection.
25
26 3.1 Deep Q-networks (DQN)
27
28
A deep Q network (DQN) is a neural network with multiple layers which takes a given state 𝑠 as input and outputs action values
29
30 vector 𝑄(𝑠, 𝑎; 𝜃), where 𝜃 stands for the network’s parameters. According to the DQN-based flow chart (Fig. 3), it shows a basic
31
structure for the DQN-based DRL algorithms, which contains a replay buffer and two networks: evaluate network and target
32
33 network. The agent interacts with the HEV environment and stores the transactions 𝑇𝑡 = (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) in replay buffer 𝐵𝑡 =
34
{𝑇1 , 𝑇2 , … , 𝑇𝑡 }. The replay buffer sample mini-batches randomly and the evaluate network grabs a mini-batch from the replay buffer
35
36 to calculate the state-action value 𝑄(𝑠, 𝑎; 𝜃𝑖 ) and target network use mini-batch data to generate target Q value 𝑦𝑖𝐷𝑄𝑁 . Given two
37
38 outputs from neural network, the loss function is designed to update the neural network. The equation to calculate and optimize
39 the loss function at each iteration 𝑖 is given by
40
41 2
42 𝐿𝑖 (𝜃𝑖 ) = 𝐸 ′
[(𝑦𝑖𝐷𝑄𝑁 − 𝑄(𝑠, 𝑎; 𝜃𝑖 )) ], (18)
(𝑠,𝑎,𝑟,𝑠 )~𝒰(𝐷)
43
44
45 with
46
47 𝑦𝑖𝐷𝑄𝑁 = 𝑟 + 𝛾 max 𝑄(𝑠 ′ , 𝑎′ ; 𝜃 − ), (19)
′ 𝑎
48
49
50 where 𝜃 – represents the parameters of a fixed and separate target network [29]. The target network’s parameter is fixed in several
51
iterations and updated after several iterations by using evaluate network’s parameters [30].
52
53
54 9
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 Experience replay improves the data efficiency by reusing the samples in multiple updates. In addition, it reduces variance as
11
uniform sampling from the replay buffer reduces the correlation among the samples used in the update. Experience replay also has
12
13 many updated versions, including Prioritized Experience Replay [31], as shown in Fig. 3 and Hindsight Experience Replay [32].
14
These updated versions are applied in different situations and better than the original Experience replay.
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34 Fig. 3 Architecture of DQN-based Algorithm.
35
36 3.2 Double DQN (D2QN)
37
38
39 In order to solve the problem in DQN that using the same value to evaluate action will generate an overestimate [33], Double

40 Deep Q-networks (DDQN) uses the action chosen by evaluating network as the input of the target network, target Q value is given
41
42 by,

43
44 𝑦𝑖𝐷𝐷𝑄𝑁 = 𝑟 + 𝛾𝑄(𝑠 ′ , 𝑎𝑟𝑔 max

𝑄(𝑠 ′ , 𝑎′ ; 𝜃𝑖 ); 𝜃 − ). (20)
𝑎
45
46
47 The difference between DDQN and DQN is the target network 𝑦𝑖𝐷𝐷𝑄𝑁 and 𝑦𝑖𝐷𝑄𝑁 [30], As shown in Fig. 3 and Fig. 4, the
48 different target networks will lead to different loss functions.
49
50
51
52
53
54 10
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 3.3 Dueling Double DQN (D3QN)
11
12 Dueling network has a different structure from a neural network, which contains the state value and advantage and then merges
13
14 to output [34]. The state value stands for the value of the state regardless of action, and the advantage value shows the advantages
15 of specific action over other actions given a specific state. The action value is given by
16
17 1
𝑄(𝑠, 𝑎; 𝜑) = 𝑉(𝑠; 𝜎) + (𝐴(𝑠, 𝑎; 𝛽) − ∑ 𝐴(𝑠, 𝑎′ ; 𝛽)), (21)
18 𝑁 ′
𝑎
19
20 where 𝜎 and 𝛽 stand for the neural network parameters, 𝜑 stands for {𝜎, 𝛽}, 𝑁 the number of actions [34].
21
22
3.4 Distributional RL (C51)
23
24
25 Distributional RL is based on DQN but uses value distribution instead of the value function. The agent interacts with the HEV
26 environment and stores trajectories into the experience replay buffer. Then sample trajectory (𝑠, 𝑎, 𝑟, 𝑠′) from the experience replay
27
28 buffer. The Bellman equation in C51 is expressed by,
29
30 𝒯 𝜋 𝑍(𝑠, 𝑎) ≔ 𝑅(𝑠, 𝑎) + 𝛾𝑃 𝜋 𝑍(𝑠, 𝑎), (22)
31
32 where 𝑍 is the value distribution, 𝒯 is the Bellman optimality operator, and 𝑃 𝜋 is an operator from Z to Z, 𝑃 𝜋 𝑍(𝑠, 𝑎) ≔ 𝑍(𝑠 ′ , 𝑎′).
33
Based on the Bellman equation, the Q value of C51 is expressed by,
34
35
36 𝑄(𝑠 ′ , 𝑎) ≔ ∑ 𝑧𝑖 𝑝𝑖 (𝑠 ′ , 𝑎|𝜃), (23)
𝑖
37
38 where 𝑝𝑖 (𝑠 ′ , 𝑎) the probability mass on each atom, z is a vector with N atoms, given as,
39
40 𝑉𝑚𝑎𝑥 −𝑉𝑚𝑖𝑛
𝑧𝑖 = 𝑉𝑚𝑖𝑛 + 𝑖 for 𝑖 ∈ {0, … , 𝑁 − 1}. (24)
N−1
41
42
where 𝑉𝑚𝑎𝑥 and 𝑉𝑚𝑖𝑛 are fixed numbers based on the real task. The aim is to update 𝜃 to let the distribution get close to the real
43
44 return’s distribution. By using KL divergence, the loss is minimized KL divergence using gradient descent by a cross-entropy term
45
of
46
47
48 𝐷𝐾𝐿 (Φ𝑧 Z(s, 𝑎 ∗ )||𝑍(𝑠, 𝑎∗ )), (25)
49
50 where Φ𝑧 is the projection of target distribution onto the z, 𝑎∗ ← argmax 𝑄(𝑠 ′ , 𝑎) [35].
𝑎
51
52
53
54 11
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 3.5 Rainbow DQN
11
12 Rainbow is based on DQN and has six extensions to address limitations and improve overall performance [36]. The six
13
14 extensions are double Q-learning, prioritized replay, multi-step learning, dueling network, distributional RL and noise net. The
15 flow chart is shown in Fig. 4. As shown in the figure, the Rainbow agent uses a prioritized replay buffer to store the transaction
16
17 experience from the HEV environment instead of basic experience replay buffer and sample the mini batch for the evaluate neural
18 network. Rainbow uses a dueling network as the basic architecture and uses the double-Q networks structure and multi-steps target
19
20 to calculate the multi-steps target Q value 𝑦𝑖𝐷𝐷𝑄𝑁 . Out the six extensions, Double Q-learning and distributional RL are described
21
in section 3.4, and dueling network is described in section 3.3. The rest three extensions are explained as follows:
22
23
24 Prioritized Replay: One reason why DDQN has better performance than DQN is prioritized experience replay (PER) [31]. The
25 main contribution of PER is increasing the probability of high-expected experiences after being rewarded by TD-error. This
26
27 contribution can reduce training time and improve the accuracy of final results.
28
29 Multi-step learning: As shown in Fig. 4, Rainbow uses multi-step targets instead of the original single-step target. Instead of just
30
using one step to accumulate reward and the next step to bootstrap, multi-step learning uses next n-steps targets [37]. The n-steps
31
32 return is given by,
33
34 𝑛−1
(𝑛) (𝑘)
35 𝑅𝑡 ≡ ∑ 𝛾𝑡 𝑅𝑡+𝑘+1 . (26)
𝑘=0
36
37
Using the multi-step target, the loss function of Rainbow is given by,
38
39 2
(𝑛) (𝑛)
40 (𝑅𝑡 + 𝛾𝑡 𝑄𝜃′ (𝑆𝑡+𝑛 , 𝑎𝑟𝑔𝑚𝑎𝑥 𝑄𝜃 (𝑆𝑡+𝑛 , 𝑎′)) − 𝑄𝜃 (𝑆𝑡 , 𝐴𝑡 )) . (27)
𝑎′
41
42
Learning speed can be regulated by tuning the parameter n in multi-steps learning [38].
43
44
Noise Net: Due to the limitation of using 𝜖-greedy in games learning, the Noise Net was developed by combining different type of
45
46 noisy stream in a linear layer [39], given as,
47
48 𝑦 = (𝑎 + 𝑏𝑥) + (𝑎𝑛𝑜𝑖𝑠𝑦 ⨀𝜖 𝑎 + (𝑏𝑛𝑜𝑖𝑠𝑦 ⨀𝜖 𝑏 )𝑥), (28)
49
50 where 𝜖 𝑎 and 𝜖 𝑏 are random variables, and ⨀ means the element-wise product.
51
52
53
54 12
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Fig. 4 Architecture of Rainbow Algorithm.
28
29
30 3.6 Quantile Regression for Distribution RL (QR-DQN)
31
32 The C51 uses N fixed locations to estimate the probability. The difference between C51 and QR-DQN is that QR-DQN uses
33
fixed probabilities and different locations. Compared with C51, QR-DQN is not restricted or bounded for value and does not need
34
35 to do the projection operation [40]. QR-DQN also uses Wasserstein Metric to calculate the loss which is more precise. Compared
36
to C51, in QR-DQN, the quantile distribution is expressed as,
37
38 𝑁
39 1
𝑍𝜃 (𝑥, 𝑎) ∶= ∑ 𝛿𝜃𝑖(𝑥,𝑎) , (29)
40 𝑁
𝑖=1
41
42 where 𝑍𝜃 is quantile distribution, 𝜃𝑖 is the uniform probability distribution support, 𝛿 the Dirac. The Q value is similar to C51
43
given as,
44
45
46 𝑄(𝑠 ′ , 𝑎) ≔ ∑ 𝑞𝑗 𝜃𝑗 (𝑠 ′ , 𝑎), (30)
47 𝑗

48
49 where 𝑞𝑗 is uniform weights and 𝑞𝑗 = 1/𝑁. QR-DQN uses Huber loss to calculate the loss function, given as,
50
𝑁 𝑁
51 1
∑ ∑ [𝜌𝜏𝜅̂ (𝑟 + 𝛾𝜃𝑗 (𝑠 ′ , 𝑎 ∗ ) − 𝜃𝑖 (𝑠, 𝑎∗ ))] , (31)
52 𝑁
𝑖=1 𝑗=1
53
54 13
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 where 𝜌 the Huber Loss, 𝑎 ∗ ← argmax 𝑄(𝑠 ′ , 𝑎).
𝑎
11
12
13 3.7 Implicit quantile value network (IQN)
14
15 In IQN, the difference between QR-DQN and IQN is IQN adopts distortion risk measure and takes a 𝜏 ∈ [0,1] with state 𝑠 as
16
input of the neural network to generate a distribution as output [41]. Then the Q value is given by:
17
18
𝑁
19 1
𝑄𝛽 (𝑠, 𝑎) ≔ ∑ 𝑍𝜏𝑛 (s′ , a′), (32)
20 𝑁
𝑛
21
22 where 𝛽: [0,1] the distortion risk measure, 𝑍𝜏 the distribution expressed by 𝑍𝜏 : = 𝐹𝑍−1 (𝜏). 𝐹𝑍−1 is the quantile function for variable
23
24 𝑍 given a base distribution 𝜏 ∈ [0,1]. In order to calculate the loss function, the network takes two base distributions

25 𝜏𝑖 , 𝜏𝑗 ~𝑈([0,1]) as input and generates two target distributions 𝑍𝜏𝑖 (𝑠, 𝑎) and 𝑟 + 𝛾𝑍𝜏𝑗 (𝑠′, 𝑎∗ ). The temporal difference is given as,
26
27 𝜏𝑖 ,𝜏𝑗
𝛿𝑡 = 𝑟 + 𝛾𝑍𝜏𝑗 (𝑠′, 𝑎∗ ) − 𝑍𝜏𝑖 (𝑠, 𝑎), (33)
28
29
1
30 where 𝑎 ∗ is the greedy function 𝑎 ∗ ← argmax ∑𝑁
𝑛 𝑍𝜏𝑛 (𝑠′, 𝑎′). And the loss function is given as,
𝑎′ 𝑁
31
32
𝑁 𝑁′
33 1 ′
𝜏𝑖 ,𝜏 𝑗
∑ ∑ 𝜌𝜏𝐾𝑖 (𝛿𝑡 𝑟 + 𝛾𝑍𝜏𝑗 (𝑠′, 𝑎 ∗ ) − 𝑍𝜏𝑖 (𝑠, 𝑎)), (34)
34 𝑁′ 𝑖=1 𝑗=1
35
36
37 where 𝑁′ and 𝑁 are the numbers of samples.
38
39 3.8 Fully Parameterized Quantile Function (FQF)
40
41
In IQN the quantile fraction is sampled and in QR-DQN the quantile fraction is fixed, which limits the algorithm in real practice.
42
43 In FQF, a fraction proposal network is used to generate quantile fractions for each (𝑠, 𝑎) and quantile value network to match the
44
probability and quantile value [42]. This self-adjusted algorithm makes it better to approach the true distribution than IQN and
45
46 QR-DQN. The quantile expression is given as,
47
48 𝑁−1

49 Z(𝑠, 𝑎) ≔ ∑(𝜏𝑖+1 − 𝜏𝑖 )𝛿𝜃𝑖(𝑠,𝑎) , (35)


𝑖=0
50
51
52
53
54 14
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 where 𝛿 is the Dirac, 𝜏0 , 𝜏𝑖 , 𝜏𝑁−1 N-1 adjustable fraction with 𝜏0 = 0 and 𝜏𝑁 = 1. The quantile function 𝐹𝑍−1 is given by the inverse
11
12 of the cumulative distribution function 𝐹𝑍 , given as 𝐹𝑍−1 (𝑝) ≔ inf{z ∈ R: p ≤ 𝐹𝑍 (z)}. Then the distortion between approximated

13 and real quantile function using 1-Wasserstein is given as,


14
15 𝑁−1
𝜏+1
−1 (𝜑)
16 𝑊1 (𝑍, 𝜏, 𝜃) = ∑ ∫ |𝐹𝑍,𝜔 1
− 𝜃𝑖 |𝑑𝜑 , (36)
𝜏
17 𝑖=0

18
19 where 𝜔1 is the parameter of the fraction proposal network. To minimize the distortion, 𝜏 and 𝜃 are needed, which are given by
20 𝜃𝑖 = 𝐹𝑍−1 (
𝜏𝑖 +𝜏𝑖+1
). Then optimal 𝜏 can be calculated to minimize 𝑊1 (𝑍, 𝜏) and the 1-Wasserstein loss is given as,
2
21
22 𝑁−1
𝜏+1
23 −1 (𝜑) −1
𝜏𝑖 + 𝜏𝑖+1
𝑊1 (𝑍, 𝜏) = ∑ ∫ |𝐹𝑍,𝜔 − 𝐹𝑍,𝜔 ( )| 𝑑𝜑 . (37)
24 𝜏
1 1
2
𝑖=0
25
26 And minimize by applying gradients descent. The Q value of FQF is given as,
27
28 𝑁−1
−1
𝜏𝑖 + 𝜏𝑖+1
29 Q(𝑠, 𝑎) ≔ ∑(𝜏𝑖+1 − 𝜏𝑖 )𝐹𝑍,𝜔 ( ), (38)
2
2
𝑖=0
30
31
where 𝜔2 is the parameter of the quantile value network. Then the TD error of two probabilities is given as,
32
33
𝜏𝑖 ,𝜏𝑗 𝜏𝑖 + 𝜏𝑖+1 𝜏𝑗 + 𝜏𝑗+1
34 𝛿𝑡 = 𝑟 + 𝛾𝐹𝑍−1
′ ,𝜔 (
−1
) − 𝐹𝑍,𝜔 ( ). (39)
1 2 1
2
35
36 The action used is from the greedy function 𝑎 ∗ ← argmax Q(𝑠′, 𝑎′)And the loss function is given as,
37 𝑎′

38
39 𝑁 𝑁′
1 𝜏𝑖 + 𝜏𝑖+1 𝜏𝑗 + 𝜏𝑗+1
40 ∑ ∑ 𝜌𝜏𝐾𝑖 ( 𝑟 + 𝛾𝐹𝑍−1
′ ,𝜔 (
−1
) − 𝐹𝑍,𝜔 ( )), (40)
𝑁′ 𝑖=1 𝑗=1
1 2 1
2
41
42
43 where 𝜌 the Huber Loss, 𝑁′ and 𝑁 are the numbers of samples.
44
45
46 3.9 Deep Deterministic Policy Gradient (DDPG)

47
48 As shown in Fig. 5, DDPG contains two kinds of neural network architecture and experience replay buffer. The two kinds of
49
neural network are called Actor network and Critic network. In each network, it also contains two sub-networks which are online
50
51 network and target network [43]. The Actor network interacts with HEV environment and store the transactions 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 in
52
53
54 15
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 experience replay buffer. The experience buffer randomly sample mini batch 𝑠𝑖 , 𝑎𝑖 , 𝑟𝑖 , 𝑠𝑖+1 and feed into Actor and Critic network.
11
12 The Critic target network calculates expected target return 𝑦𝑖 by using action 𝜇′(𝑠𝑖+1 ) given by Actor target network.

13
14 𝑦𝑖 = 𝑟𝑖 + 𝛾𝑄′ (𝑠𝑖+1 , 𝜇′ (𝑠𝑖+1 )), (41)
15
16 where 𝛾 is the discount factor, 𝑄′ is Critic target network and 𝜇′ Actor target network. With the target Q value, the Critic loss can
17 be expressed by,
18
19 1 2
20 𝐿= ∑ (𝑦𝑖 − 𝑄(𝑠𝑖 , 𝑎𝑖 )) , (42)
𝑁 𝑖
21
22 where 𝑁 is the number of the mini batches. With the help of Critic network, the Actor policy is updated by the sampled policy
23
24 gradient, given by
25
26 1
∇𝜃 𝜇 𝐽 = ∑ [∇𝑎 𝑄(𝑠𝑖 , 𝜇(𝑠𝑖 ))∇𝜃 𝜇 𝜇(𝑠𝑖 |𝜃𝜇 )] , (43)
27 𝑁 𝑖

28
29 where 𝜃 𝜇 is the parameters of online Actor network. In order to improve the learning stability, the target network updates softly

30 and slowly with a small hyperparameter 𝜀,


31
32
𝜃 ′ ← 𝜀𝜃 + (1 − 𝜀)𝜃 ′ ,
33
34
35 𝜃𝜇 ′ ← 𝜀𝜃𝜇 + (1 − 𝜀)𝜃 𝜇 ′ , (44)
36
37 where 𝜃 ′ is parameters of Critic target network, 𝜃 is parameters of Critic online network, 𝜃𝜇 ′ is parameters of Actor target network.
38
39 As an off-policy algorithm which combines the target network, experience replay buffer and on-policy actor-critic, sample
40 efficiency has improved. However, there are still some limitations that need to be solved.
41
42
43
44
45
46
47
48
49
50
51
52
53
54 16
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Fig. 5 Architecture of DDPG.
32
33
34 3.10 Twin-Delayed DDPG (TD3)
35
36 TD3 is built based on DDPG to solve limitations on approximation error and improve stability [44]. TD3 contains the techniques
37
of continuous Double Q learning, Policy Gradient and Actor Critic. The difference in structure between DDPG and TD3 is that
38
39 TD3 has two Critic networks which contain two online networks (𝑄1 , 𝑄2 ) and two target networks (𝑄′1 , 𝑄′2 ). Then the expected
40
target value is given by,
41
42
𝑦1 = 𝑟 + 𝛾𝑄′1 (𝑠′, 𝑎̅),
43
44
𝑦2 = 𝑟 + 𝛾𝑄′2 (𝑠′, 𝑎̅), (45)
45
46
where μ′ is Actor target network. In order to resolve overestimation, TD3 takes the minimum of two estimates as the expected
47
48 target value, given by
49
50 𝑦 = 𝑟 + 𝛾 min 𝑄′𝑖 (s′, 𝑎̅). (46)
𝑖=1.2
51
52
53
54 17
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 One key improvement in TD3 is target policy smoothing, which acts as a regulator to solve the problem of overfitting in Q value
11
computation by adding noise ε, given by,
12
13
14 𝑎̅ = μ′ (s′ ) + ε, ε~clip(Ν(0, σ), −c, c). (47)
15
16 Given the target value, the loss function of the Critic network is given by,
17
18 1 2
𝐿1 = ∑ (𝑦𝑖 − 𝑄1 (𝑠𝑖 , 𝑎𝑖 )) ,
𝑁 𝑖
19
20
1 2
21 𝐿2 = ∑ (𝑦𝑖 − 𝑄2 (𝑠𝑖 , 𝑎𝑖 )) , (48)
𝑁 𝑖
22
23
where 𝐿1 , 𝐿2 is the loss function of Critic network 1 and Critic network 2. After backpropagating the loss and updating two Critic
24
25 networks, update Actor network by gradient ascent using the output of Critic network, i.e.,
26
27 1
∇𝜃𝜇 𝐽 = ∑ [∇𝑎 𝑄1 (𝑠𝑖 , 𝜇(𝑠𝑖 ))∇𝜃𝜇 𝜇(𝑠𝑖 |𝜃𝜇 )] , (49)
28 𝑁 𝑖

29
30 where 𝜃 𝜇 is the parameters of online Actor network. TD3 also uses the soft update method, given by,
31
32 𝜃′𝑖 ← 𝜀𝜃𝑖 + (1 − 𝜀)𝜃′𝑖 ,
33
34 𝜃𝜇 ′ ← 𝜀𝜃𝜇 + (1 − 𝜀)𝜃 𝜇 ′ , (50)
35
36
37 where 𝜃′𝑖 is parameters of Critic target network, 𝜃𝑖 is parameters of Critic online network, 𝜃𝜇 ′ is parameters of Actor target
38 network.
39
40
3.11 Trust Region Policy Optimization (TRPO)
41
42
43 The main idea of TRPO is to update the policy by taking the farthest step while keeping the step in a constraint to make sure
44 that the new policy stays within an allowed distance from the old policy [45]. This constraint is expressed by KL-divergence. The
45
46 agent interacts with HEV environment to collect trajectories 𝐷 = {𝑠0 , 𝑎0 , … , 𝑎 𝑇−1 , 𝑠𝑇 } using policy network. Given the trajectories,
47 Critic network outputs the advantage value, given by,
48
49 𝐴̂𝑡 = −𝑉(𝑠𝑡 ) + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇−𝑡+1 𝑟𝑇−1 + 𝛾 𝑇−𝑡 𝑉(𝑠𝑇 ), (51)
50
51 where 𝑡 is the time index in [0, 𝑇], 𝑉 is the current value function. Based on the advantage value, the policy is updated given by,
52
53
54 18
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )
𝜃𝑘 = argmax 𝐸̂𝑡 [ 𝐴̂ ], (52)
11 𝜃 𝜋𝜃𝑜𝑙𝑑 (𝑎𝑡 |𝑠𝑡 ) 𝑡
12
13 s.t. 𝐸̂𝑡 [𝐷𝐾𝐿 (𝜋𝜃 (∙ |𝑠𝑡 )||𝜋𝜃𝑜𝑙𝑑 (∙ |𝑠𝑡 ))] ≤ 𝛿, (53)
14
15 where 𝜋𝜃𝑜𝑙𝑑 is the old policy before update, 𝐷𝐾𝐿 is the KL-divergence. Due to the complexity of the theoretical TRPO, some
16
17 approximations are made using Taylor’s series in the second order for a quicker learning, where the loss and KL-divergence are
18
given as,
19
20
𝐿(𝜃) ≈ 𝑔𝑇 (𝜃 − 𝜃𝑘 ),
21
22
1
23 ̅𝐾𝐿 (𝜃||𝜃𝑘 ) ≈
𝐷 ̅𝐾𝐿 ≤ 𝛿
(𝜃 − 𝜃𝑘 )𝑇 𝐻(𝜃 − 𝜃𝑘 ), 𝐷 (54)
2
24
25 where 𝑔 is the policy gradient and 𝐻 is the measurement of relativity between policy and parameter 𝜃. This quadratic equation can
26
27 be solved given by,
28
29 2𝛿
30 𝜃𝑘+1 = 𝜃𝑘 + 𝛼 𝑗 √ 𝐻 −1 𝑔, (55)
𝑔𝑇 𝐻 −1 𝑔
31
32
where 𝛼 is the backtracking coefficient and 𝑗 is the smallest nonnegative integer which makes policy 𝜋𝜃𝑘+1 satisfying the KL-
33
34 divergence constraint. Due to the difficulty of calculating and storing 𝐻 −1 , conjugate gradient algorithm is used to solve 𝐻 −1 by
35
using 𝑥 = 𝐻 −1 𝑔, and 𝜃𝑘+1 is then expressed as
36
37
38 2𝛿
𝜃𝑘+1 = 𝜃𝑘 + 𝛼 𝑗 √ 𝑥 . (56)
39 𝑥𝑘𝑇 𝐻𝑥𝑘 𝑘
40
41 The Critic network is updated by MSE, given by,
42
43 ∅𝑘+1 = argmin 𝐸[𝑉(𝑠𝑡 |∅) − 𝑅𝑡 ], (57)
44 ∅

45
46 where ∅𝑘+1 is the parameter of the Critic network.
47
48 3.12 The Proximal Policy Optimization Algorithm (PPO)
49
50 PPO has the same aim as TRPO to address the problem of how to take the biggest step on policy improvement without going so
51
52 far which may cause collapse [46]. As shown in Fig. 6, agents interact with HEV environment to collect trajectories 𝐷 =
53 {𝑠0 , 𝑎0 , … , 𝑎 𝑇−1 , 𝑠𝑇 } using policy network. Given the trajectories, the Critic network outputs the advantage value, given by,
54 19
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 𝐴̂𝑡 = −𝑉(𝑠𝑡 ) + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇−𝑡+1 𝑟𝑇−1 + 𝛾 𝑇−𝑡 𝑉(𝑠𝑇 ), (58)
11
12 where 𝑡 is the time index in [0, 𝑇], 𝑉 is the current value function. Based on the advantage value, the loss function of policy is
13
14 given as,
15
16 𝜋𝜃 (𝑎𝑡 , 𝑠𝑡 ) 𝜋𝜃 (𝑎𝑡 , 𝑠𝑡 )
𝐿(𝜃) = 𝐸̂𝑡 [min ( 𝐴̂ , 𝑐𝑙𝑖𝑝 ( , 1 − 𝜖, 1 + 𝜖) 𝐴̂𝑡 )], (59)
17 𝜋𝜃𝑜𝑙𝑑 (𝑎𝑡 , 𝑠𝑡 ) 𝑡 𝜋𝜃𝑜𝑙𝑑 (𝑎𝑡 , 𝑠𝑡 )
18
19 where 𝜋𝜃 and 𝜋𝜃𝑜𝑙𝑑 are new and old policies, 𝜖 is a hyperparameter which keeps the probability ratio in the interval [1 − 𝜖, 1 + 𝜖]
20
to control how far the new policy can go from the old policy. This gives a first-order method to trust region optimization, it makes
21
22 sure the agent will not be too greedy to the positive value actions and not ignore negative value actions very quickly. Then the
23
policy is updated by stochastic gradient ascent, given by,
24
25
𝜃𝑘+1 = argmax 𝐸[𝐿(𝜃)]. (60)
26 𝜃

27
28 The critic network is updated by MSE, given by,
29
30 ∅𝑘+1 = argmin 𝐸[𝑉(𝑠𝑡 |∅) − 𝑅𝑡 ], (61)

31
32 where ∅𝑘+1 is the parameter of the Critic network.
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53 Fig. 6 Architecture of PPO.
54 20
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 3.13 The Soft Actor critic (SAC)
11
12 SAC is an off-policy algorithm which combines actor-critic, clipped double-Q method, and entropy regularization [47]. The
13
14 agent interacts with HEV environment to generate experience trajectory and store in experience replay buffer. The experience
15 buffer randomly sample mini batch D = [𝑠𝑖 , 𝑎𝑖 , 𝑟𝑖 , 𝑠𝑖+1 ]. Entropy regularization is introduced in SAC to maximize the weight
16
17 between the expected return and the entropy that measures the policy’s randomness. The entropy will increase when more
18 exploration is needed. Entropy regularization can avoid policy from getting into an optimal policy in the beginning. Based on the
19
20 Bellman equation, the Q functions is given as,
21
22 𝑦(𝑠𝑖 , 𝑎𝑖 ) = 𝑟𝑖 + 𝛾 (min 𝑄𝜇,𝑗 (𝑠𝑖+1 , 𝜋 (∙ |𝑠𝑖+1 )) − 𝛼 log(𝜋 (𝜋 (∙ |𝑠𝑖+1 )|𝑠𝑖+1 ))) , (62)
𝑗=1,2
23
24
where 𝜇 is the Q target network parameter, 𝜋 is the policy function. And the update of Q function using gradient descent is thus
25
26 given as
27
28 ∇∅,𝑖
1
∑(𝑠𝑖,𝑎𝑖,𝑟𝑖,𝑠𝑖+1)∈𝐷(𝑄∅,𝑖 (𝑠𝑖 , 𝑎𝑖 ) − 𝑦(𝑠𝑖 , 𝑎𝑖 ))2 , 𝑖 = 1,2 (63)
|𝐷|
29
30
where ∅ is the Q network parameter. The policy is updated by gradient ascent, giving as
31
32
1
33 ∇𝜃 ∑ (min 𝑄∅,𝑖 (𝑠𝑖 , 𝜋 (∙ |𝑠𝑖 )) − 𝛼 log(𝜋 (𝜋 (∙ |𝑠𝑖+1 )|𝑠𝑖+1 )))2 , (64)
|𝐷| 𝑖=1,2
34 𝑠𝑖 ∈𝐷

35
36 where 𝜃 is the policy network parameter. The target network is updated using soft update, given as,
37
38 𝑄𝜇,𝑖 ← (1 − 𝜏)𝑄∅,𝑖 + 𝜏𝑄𝜇,𝑗 . (65)
39
40 4. Simulation and results
41
42
This section first presents the details of how the whole simulation setup, including environment information, neural network
43
44 setting, and algorithm’s learning parameters. Secondly, the result comparison is divided into three parts: first is based on continuous
45
action space using different DRL Algorithms, second is based on discrete action space with different numbers of discrete actions
46
47 using DQN, third is based on discrete action space using different DRL Algorithms. In each part, different algorithms are trained
48
in the same environment for the same episodes, and training curves are shown. Key data are recorded during the training and listed
49
50 in the table. Also, the performances of different algorithms are compared with each other in terms of computation time, reward,
51
and fuel economy. Finally, the best DRL algorithms of each action space type are compared with baseline EMS.
52
53
54 21
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 4.1 Simulation setup
11
12 The vehicle environment for the simulation is an Open AI Gym-like environment, and it shares function structure with Gym
13
14 environment. There are two versions of the vehicle environment that are only different in the action space: one has discrete action
15 space, and the other has continuous action space. Tianshou [48] is utilized for the DRL algorithms. To achieve a fair comparison,
16
17 we use the same hyperparameters (i.e., buffer-size is 20000, hidden-size is 128*128*128*128, batch-size is 64, learning-rate is
18 0.001, discount-factor is 0.9, etc.) for the neural network design. The computer used in this simulation has NVIDIA GeForce RTX
19
20 2070 GPU with Max-Q Design, Intel(R) Core (TM) i7-9750H CPU @ 2.60GHz and 16.0GB RAM. We open-sourced the code of
21 the paper at https://github.com/LittleWebCat/DRL-Base-EMS.
22
23
24 4.2 DRL results comparison

25
26 Fig. 7 depicts the track of the reward along the training process using four DRL algorithms in continuous action space. Each
27
curve is smoothed with a moving average of 10. The computation time is the time consumed during the entire learning process.
28
29 The convergence reward is the mean reward after convergence. The convergence episode is the episode when the reward reaches
30
convergence reward. The convergence time is the time when the agent reaches the convergence episode. The final test reward and
31
32 final test miles per gallon equivalent (MPGE) is the test result of three different driving cycles (UDDS, WLTP, HWFET) using
33 the best performance agent saved during the learning process in WLTP. MPGE considers the difference of initial and final SOC
34
35 of the battery by converting the SOC difference to gasoline using 33.7 kWh per gallon ratio provided by EPA [49]. These results
36 reveal that all fiveour continuous DRL-based EMS converge within 1050 episodes. The starting point for the training is different
37
38 since the initial parameters for the training are stochastic. The comparison of the result shows that PPO, TRPO and TD3 have
39 higher learning speeds at the beginning of the training. The reward of SAC changes slowly compared with the rest of the algorithms.
40
41 As we can see from Table 2, SAC has the highest convergence reward. However, the convergence speed of SAC is much slower
42 than other DRL algorithms. PPO has the fastest convergence speed compared to other algorithms. PPO requires 72.9%, 79.0%,
43
44 95.1% and 91.3% less time consumption than TRPO, TD3, SAC and DDPG, respectively, to reach convergence. SAC’s
45 convergence reward is 0.50%, 0.06%, 0.09%, 0.04% higher than PPO, TRPO, TD3, DDPG respectively. We also can notice that
46
47 TD3’s learning curves are more stable than SAC, DDPG, PPO and TRPO after achieving convergence according to its flatness.
48 For training the same total episode, PPO saves 38.4%, 34.4%, 57.6% and 27.9% time than TRPO, TD3, SAC and DDPG,
49
50 respectively. Also, from the table record, SAC has the highest test reward and MPGE value in WLTP and UDDS. SAC’s test
51 reward is 10.9%, 15.7%, 0.40%, 2.48% higher than PPO, TRPO, TD3, DDPG respectively in WLTP.
52
53
54 22
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 Formatted: Font: 7.5 pt
17 Formatted: Font: 6.5 pt
18 Formatted: Font: 6.5 pt
19 Formatted: Line spacing: 1.5 lines
20
Formatted: Font: 6.5 pt
21
Formatted: Font: 6.5 pt
22
23 Formatted: Font: 6.5 pt
24 Formatted: Font: 6.5 pt
25 Formatted: Font: 6.5 pt
26 Formatted: Font: 6.5 pt
27 Formatted: Font: 6.5 pt
28
Formatted: Font: 6.5 pt
29
Formatted Table
30
31 Formatted: Font: 6.5 pt, Not Bold

32 Fig. 7 Comparison of DRL algorithms in continuous action space. Formatted: Font: 6.5 pt, Not Bold
33 Formatted: Font: 6.5 pt, Not Bold
34 Table 2 Parameters of DRL algorithm in continue action space. Formatted: Font: 6.5 pt, Not Bold
35 Formatted: Font: 6.5 pt, Not Bold
36
Formatted: Font: 6.5 pt, Not Bold
AGENT COMPUTATI CONVERGENCE CONVERGENCE CONVERGE UDDS TEST UDDS WLTP WLTP HWFET HWFE
37 ON TIME REWARDCONVERGE EPISODECONV NCE TIME REWARD TEST TEST TEST TEST T
(S)COMPUTA NCE REWARD ERGENCE (S)CONVER MPGE REWARD MPGE REWAR TEST Formatted: Font: 6.5 pt, Not Bold
38 TION TIME EPISODE GENCE D MPGE
39 Formatted: Font: 6.5 pt, Not Bold
(S) TIME (S)

40 PPO 866 -1412.2 11 39 -774.1 32.57 -1573.3 30.93 -932.2 37.11 Formatted: Font: 6.5 pt, Not Bold
41 TRPO 1406 -1406.1 21 144 -782.2 32.20 -1662.0 29.23 -933.6 37.01 Formatted: Font: 6.5 pt, Not Bold
TD3 1321 -1406.4 31 186 -602.7 41.84 -1406.8 34.56 -931.2. 37.16
42 Formatted: Font: 8 pt
SAC 2042 -1405.2 74 797 -601.5 41.94 -1401.2 34.68 -932.6 37.10
43 Formatted: Line spacing: 1.5 lines
DDPG 1201 -1405.7 75 446 -605.8 41.62 -1436.8 33.81 -934.2 37.03
44
Formatted: Font: 8 pt
45
Formatted: Line spacing: 1.5 lines
46
47 Formatted: Font: 8 pt
Agent Udds Udds Wltp Wltp Hwfet Hwfet Formatted: Line spacing: 1.5 lines
48
49 Test Test Test Test Test Test Formatted: Font: 8 pt
50 Formatted: Line spacing: 1.5 lines
Reward Mpge Reward Mpge Reward Mpge
51 Formatted: Font: 8 pt
52 PPO -774.1 32.57 -1573.3 30.93 -932.2 37.11 Formatted: Line spacing: 1.5 lines
53
Formatted Table
54 23
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 TRPO -782.2 32.20 -1662.0 29.23 -933.6 37.01
11
TD3 -602.7 41.84 -1406.8 34.56 -931.2. 37.16
12
13 SAC -601.5 41.94 -1401.2 34.68 -932.6 37.10
14
DDPG -605.8 41.62 -1436.8 33.81 -934.2 37.03
15
16
17
18
19 For the DRL with discrete action space, the action discretization is important and thus worth investigating. Fig. 8 shows the
20
21 learning curves of DQN under different discrete action spaces which contain 50, 200, 2000, and 5000 actions, respectively. Each
22 curve is smoothed with a moving average of 10. According to the figure, to reach convergence, DQN uses the least episodes in
23
24 action space 50 and most episodes in action space 5000. As shown in Table 3, the computation time does not have too much
25 difference among action spaces 50, 200, 2000 and 5000. However, the time cost to achieve convergence in action spaces 50 is
26
27 much less than in action spaces 200, 2000 and 5000. The final test reward of DQN in action space 2000 is the highest, which is

28 1.11%, 1.04%, and 0.09% higher than action spaces 50, 200, and 5000, respectively in WLTP. DQN applied in action space 2000
29
30 has the best fuel economy, which is 25.01 MPGE. In the following analysis, 2000 is selected as the action space discretization.

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54 24
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32 Fig. 8 Comparison of DQN in different discrete action spaces
33
34 Table 3 Parameters of DQN algorithm in different discrete action spaces.
35
36 ACTION COMPUTATION CONVERGENCE CONVERGENCE CONVERGENCE Formatted: Font: 9 pt
37 SPACE TIME (S) REWARD EPISODE TIME (S)
Formatted: Line spacing: single
38
39 50 1045 -1408 6 30 Formatted: Font: 9 pt
40
41 200 1041 -1412 28 146 Formatted: Line spacing: single

42 Formatted: Font: 9 pt
2000 1067 -1402 31 155
43 Formatted: Line spacing: single
44 5000 1074 -1415 77 528
Formatted: Font: 9 pt
45 ACTION COMPUTATION CONVERGENCE CONVERGENCE CONVERGENCE UDDS UDDS WLTP WLTP HWFET HWFET Formatted: Line spacing: single
46 SPACE TIME (S) REWARD EPISODE TIME (S) TEST TEST TEST TEST TEST TEST
REWARD MPGE REWARD MPGE REWARD MPGE
Formatted: Font: 9 pt
47
50 1045 -1408 6 30 -599.2 42.09 -1404.9 34.58 -933.8 37.04 Formatted: Line spacing: single
48
200 1041 -1412 28 146 -600.0 42.03 -1403.9 34.63 -932.9 37.09
49 2000 1067 -1402 31 155 -572.5 44.14 -1389.3 35.01 -932.5. 37.11
50 5000 1074 -1415 77 528 -546.9 46.24 -1390.6 35.01 -932.6 37.10
51
52
53
54 25
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 AGENT Udds Test Udds Wltp Wltp Hwfet Hwfet Formatted Table
11
12 Reward Test Test Test Test Test Mpge

13 Mpge Reward Mpge Reward


14
15 50 -599.2 42.09 -1404.9 34.58 -933.8 37.04

16 200 -600.0 42.03 -1403.9 34.63 -932.9 37.09


17
18 2000 -572.5 44.14 -1389.3 35.01 -932.5. 37.11
19 5000 -546.9 46.24 -1390.6 35.01 -932.6 37.10
20
21
22
23
Fig. 9 depicts the track of the reward along the training process using different DRL algorithms in discrete action space, including
24
25 DQN, Categorical DQN (C51), QR-DQN, IQN, FQF, Rainbow DQN, SAC, Double DQN (D2QN), Dueling Double DQN (D3QN).
26
Each curve is smoothed with a moving average of 10. As we can see from Table 4, DQN, D2QN and IQN have similar computation
27
28 time, D3QN, SAC, and C51 have similar computation time, QR-DQN, Rainbow have similar computation time for the same total
29
episode. FQF has the best convergence reward, however, the convergence speed is much slower than DQN, D2QN, D3QN, IQN,
30
31 SAC and C51. In all algorithms, different RDL algorithms have different convergence speed. DQN reaches convergence fastest
32
using 179s, similar to DQN and IQN, slightly quicker than SAC, D3QN, C51, QR-DQN, and remarkably quicker than FQF and
33
34 Rainbow. Although the convergence speed of IQN, SAC, DQN, D2QN, and D3QN are much quicker, the convergence rewards
35
are lower than FQF and Rainbow. The D3QN is much more stable than DQN and D2QN during the learning process. In the training
36
37 curves, D3QN finally reaches an average result performance of around -1406.1, which is the most stable learning curve in all
38
algorithms. DQN saves 3.2%, 40.7%, 43.5%, 17.1%, 55.1%, 45.4%, 65.8%, 67.4% computation time than D2QN, D3QN, SAC,
39
40 IQN, C51, QR-DQN, FQF, and Rainbow, respectively, to reach convergence. FQF has the best convergence reward, which is
41
0.14%, 0.28%, 0.07% higher than QR-DQN, C51 and Rainbow, 1.34%, 0.21%, 0.14% higher than DQN, D2QN and D3QN
42
43 respectively, 0.35%, 0.64% higher than IQN and SAC. To train over the same total episode, DQN saves 12.0%, 27.4%, 3.4%,
44
45
46
47
48
49
50
51
52
53
54 26
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 33.4%, 43.4%, 17.2%, 51.4%, 56.8% time than IQN, SAC, D2QN, D3QN, FQF, C51, QR-DQN, Rainbow respectively. And FQF
11
has the best MPGE performance (35.49) compared to other algorithms in WLTP
12
13
14
15
16
17
18
19
20
Formatted: Font: 6.5 pt
21
Formatted: Font: 6.5 pt
22
23 Formatted: Font: 6.5 pt
24 Formatted: Line spacing: single
25 Formatted: Font: 6.5 pt
26 Formatted: Font: 6.5 pt
27 Formatted: Font: 6.5 pt
28
Formatted: Font: 6.5 pt
29
Formatted: Font: 6.5 pt
30
31 Formatted: Font: 6.5 pt

32 Formatted: Font: 6.5 pt


33 Formatted: Font: 6.5 pt, Not Bold
34 Formatted: Font: 6.5 pt, Not Bold
35 Formatted: Font: 6.5 pt, Not Bold
Fig. 9 Comparison of DRL algorithms in discrete action space
36
Formatted: Font: 6.5 pt, Not Bold
37
Table 4 Parameters of DRL algorithm in discrete action space. Formatted: Font: 6.5 pt, Not Bold
38
39 Formatted: Font: 6.5 pt, Not Bold

40 AGENT COMPUTATION
TIME (S)
CONVERGENCE
REWARD
CONVERGENCE
EPISODE
CONVERGENCE
TIME (S)
UDDS
TEST
UDDS
TEST
WLTP
TEST
WLTP
TEST
HWFET
TEST
HWFET
TEST
Formatted: Font: 6.5 pt, Not Bold
REWARD MPGE REWARD MPGE REWARD MPGE
41 IQN 1167 -1409 35 216 -586.8 43.00 -1402.5 34.65 -931.2 37.16
Formatted: Font: 6.5 pt, Not Bold
42 Formatted: Font: 6.5 pt, Not Bold
SAC 1414 -1413 48 317 -574.7 43.94 -1386.7 35.06 -932.9 37.08
43 Formatted: Font: 6.5 pt, Not Bold
DQN 1027 -1423 31 179 -572.7 44.14 -1389.3 35.01 -932.6. 37.10
44
Formatted: Font: 8 pt
45 D2QN 1063 -1407 35 185 -575.0 43.92 -1396.3 34.82 -931.6 37.13
Formatted: Font: 8 pt
46 D3QN 1542 -1406 46 302 -602.8 41.84 -1406.1 34.56 -930.6 37.18
47 Formatted: Font: 8 pt
FQF 1813 -1404 56 523 -572.6 44.16 -1371.6 35.49 -934.4 37.03
48 Formatted: Font: 8 pt
C51 1241 -1408 63 399 -578.9 43.61 -1378.7 35.27 -933.7 37.05
49 Formatted: Font: 8 pt
QR-DQN 2113 -1406 32 328 -583.9 43.27 -1372.2 35.45 -933.9 37.05
50 Formatted: Font: 8 pt
51 RAINBOW 2378 -1405 60 549 -602.1 41.90 -1398.8 34.77 -932.4 37.11
Formatted: Font: 8 pt
52 Formatted: Font: 8 pt
53
Formatted: Font: 8 pt
54 27
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
Hwfet Hwfet Test
12
13 Test Mpge
14
Reward
15
16 -931.2 37.16
17
18
19
20
21 -932.9 37.08
22
23 -932.6. 37.10
24 -931.6 37.13
25
26 -930.6 37.18
27 -934.4 37.03
28
29 -933.7 37.05
30 -933.9 37.05
31
32 -932.4 37.11
33
34
35
36
4.3 DRL and baseline methods comparison
37
38
39 Besides the comparison among DRL algorithms, the comparison against conventional EMSs is also important to show the
40 difference between DRL and conventional EMSs. In this section, two popular conventional EMSs (i.e., Rule-based and ECMS)
41
42 are applied. More information on comparison is shown in Fig. 10. In Fig. 10(a), it shows the driving speed during one complete
43 UDDS driving cycle. Comparing Fig. 10(a), Fig. 10(b) and Fig. 10(c), the vehicle speed is determined by the Engine power and
44
45 Battery which means with increasing output of Engine and Battery power, the speed increase; On the contrary, when the vehicle
46 brake to decrease the speed, the Engine power decreases and the battery charges using the energy generated during vehicle braking.
47
48 Comparing Fig. 10(c) and Fig. 10(d), Engine power has the same shape of Engine fuel rate. Comparing Fig. 10(c) and Fig. 10(e),
49 the shape of Engine power and the negative part of Battery power is opposite, which means the battery is also charging when the
50
51 engine output power. Fig. 10(f) shows the accumulative fuel in one complete driving cycle. The Rule-based agent cost 521.6g,
52 ECMS agent cost 494.5g, SAC agent cost 478.5g, FQF agent cost 468.1g of fuel. The fuel economies of these four EMS are 38.2
53
54 28
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 MPGE, 40.3 MPGE, 41.9 MPGE, and 44.2 MPGE, respectively. The FQF agent has the least accumulative fuel consumption.
11
SAC’s fuel consumption is 8.26%, 3.24% less than Rule-based and ECMS. FQF’s fuel consumption is 10.26%, 5.34% less than
12
13 Rule-based and ECMS.
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54 29
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 Commented [XB1]: Vehicle speed unit conversion is
11 wrong. mph should have higher value than mps. Use some
12 common sense. In china we use kmph. Highway limit is 90-
120kmph. In the us highway limit is 60-70mph. Just google
13 udds driving cycle to see what’s the magnitude looks like.
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53 Fig. 10 Comparison of SAC, FQF, rule-based, and ECMS EMSs in UDDS Test.
54 30
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 5. Discussion
11
12 Based on the fuel economy comparison, all the 13 DRL-based EMSs show higher MPGE than the rule-based and ECMS methods.
13
14 This observation is encouraging and indicates the better optimality performance of DRL methods than the conventional methods.
15 One interesting observation is that seven out of nine DRL algorithms with discrete action space exhibit higher rewards than all
16
17 four DRL algorithms with continuous action space. Even though DRL algorithms are general, their performance is application
18 dependent. This observation shows that the HEV EMS application favors DRL algorithms with discrete action space even though
19
20 some other applications favor DRL algorithms with continuous action space. In addition, the algorithms performing well in OPEN-
21 AI Gym environment do not guarantee good performance in the HEV EMS environment, such as RAINBOW and C51. Based on
22
23 the analysis of the results, FQFQR-DQN is recommended for the HEV EMS application for future research for its high reward. If Formatted: Font color: Red
24 computation cost is also considered, DQN is the first option as it only requires 43.42% computation time of QR-DQN.
25
26 Even though DRL algorithms show promising reward optimization performance, some issues exist and require researchers’
27 attention. For example, the learning curve is extremely fluctuating, which leads to uncertainty in the final model after the entire
28
29 learning process. It is usually suggested to run the learning process multiple times and average the reward across the multiple runs
30 when plotting the learning curve. This only smooths the learning curve and makes it look better, while the uncertainty of the final
31
32 model is not addressed. One way to reduce the uncertainty of the final model is replacing the final model with the best model saved
33 in the middle of learning, which generates the highest reward during the learning process and is utilized in this paper. However,
34
35 even though the model generates the best reward, it may be the result of good luck in exploration and thus lacks generality when
36 environment conditions shift slightly (i.e., lacks robustness and adaptiveness to environment variation).
37
38
6. Conclusion
39
40
41 This paper introduces and benchmarks 13 popular deep reinforcement learning (DRL) algorithms in the HEV EMS application.
42 Both DRL algorithms with discrete and continuous action space are considered and compared. In the continuous actions space, Formatted: Font color: Red
43
44 SACTD3 has the highest MPGE. We found that the PPODDPG takes the least time to finish the training and SACTD3 has the
45 highest test reward. According to the training curves, SAC’s test reward is 10.9%, 15.7%, 0.40%, 2.48% higher than PPO, TRPO,
46
47 TD3, DDPG respectively in WLTP testDDPG is more stable than TD3, PPO and TRPO. TD3’s test reward is 0.08%, 0.3%, 0.78% Formatted: Font color: Red
48 higher than TRPO, DDPG, PPO respectively. PPO saves 38.4%, 34.4%, 57.6% and 27.9% time than TRPO, TD3, SAC and DDPG,
49
50 respectivelyDDPG saves 10.1%, 10.2%, 7.9% time than TRPO, TD3, and PPO, respectively. In discrete action space, DQN has Formatted: Font color: Red
51 the lowest time cost, and QR-DQNFQF has the best test reward performance. DQN saves 12.0%, 27.4%, 3.4%, 33.4%, 43.4%,
52
53 17.2%, 51.4%, 56.8% time than IQN, SAC, D2QN, D3QN, FQF, C51, QR-DQN, Rainbow respectivelyDQN saves 19.1%, 10.3%, Formatted: Font color: Red
54 31
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 0.8%, 4.9%, 43.2%, 23.4%, 58.0%, and 59.4% time than IQN, SAC, D2QN, D3QN, FQF, C51, QR-DQN, and Rainbow
11
respectively. FQF has the best convergence reward, which is 0.14%, 0.28%, 0.07% higher than QR-DQN, C51 and Rainbow,
12
13 1.34%, 0.21%, 0.14% higher than DQN, D2QN and D3QN respectively, 0.35%, 0.64% higher than IQN and SACQR-DQN has Formatted: Font color: Red
14
the best test reward, which is 0.8%, 0.68%, 1.01%, 1.18%, 1.16%, 1.21% ,1.49%, and 1.58% higher than FQF, C51, Rainbow,
15
16 DQN, D2QN and D3QN, IQN and SAC, respectively. Test results of IQN, SAC, DQN, D2QN, D3QN, FQF, QR-DQN are higher
17
than results of PPO, TD3, DDPG, TRPO. Comparing TD3, QR-DQN, rule-based and ECMS, the fuel consumption of SACTD3 is
18
19 8.26% and 3.24%9.01% and 4.02% less than Rule-based and ECMS, respectively; while the fuel consumption of QRFQF-DQN is Formatted: Font color: Red
20
10.26% and 5.34%14.97% and 10.31% less than Rule-based and ECMS, respectively. Formatted: Font color: Red
21
22 The future work will focus on improving the learning efficiency and robustness. The output actions in discrete action space are
23
discretized and may lead to violent oscillation after achieving a steady situation. In that case, an optimization of the neural network
24
25 for the DRL algorithms is needed in the future to contain a stable performance. Also, a test on a real vehicle will be implemented
26
in the future.
27
28
29 Reference
30 [1] B. M. Al-Alawi and T. H. Bradley, “Review of hybrid, plug-in hybrid, and electric vehicle market modeling Studies,”
31 Renewable and Sustainable Energy Reviews, vol. 21, pp. 190–203, May 2013, doi: 10.1016/j.rser.2012.12.048.
32
[2] G. Jinquan, H. Hongwen, P. Jiankun, and Z. Nana, “A novel MPC-based adaptive energy management strategy in plug-
33 in hybrid electric vehicles,” Energy, vol. 175, pp. 378–392, May 2019, doi: 10.1016/j.energy.2019.03.083.
34
[3] J. Guo, H. He, and C. Sun, “ARIMA-Based Road Gradient and Vehicle Velocity Prediction for Hybrid Electric Vehicle
35 Energy Management,” IEEE Transactions on Vehicular Technology, vol. 68, no. 6, pp. 5309–5320, Jun. 2019, doi:
36 10.1109/TVT.2019.2912893.
37
[4] F. Malmir, B. Xu, and Z. Filipi, “A Heuristic Supervisory Controller for a 48V Hybrid Electric Vehicle Considering
38 Fuel Economy and Battery Aging,” SAE Technical Papers, Dec. 2018, doi: 10.4271/2019-01-0079.
39
[5] P. Pisu and G. Rizzoni, “A Comparative Study Of Supervisory Control Strategies for Hybrid Electric Vehicles,” IEEE
40 Transactions on Control Systems Technology, vol. 15, no. 3, pp. 506–518, 2007, doi: 10.1109/TCST.2007.894649.
41
42 [6] H. A. Borhan, A. Vahidi, A. M. Phillips, M. L. Kuang, and I. V. Kolmanovsky, “Predictive energy management of a
power-split hybrid electric vehicle,” in 2009 American Control Conference, Jun. 2009, pp. 3970–3976. doi:
43 10.1109/ACC.2009.5160451.
44
[7] L. V. Pérez, G. R. Bossio, D. Moitre, and G. O. García, “Optimization of power management in an hybrid electric
45 vehicle using dynamic programming,” Mathematics and Computers in Simulation, vol. 73, no. 1–4, pp. 244–254, Nov. 2006,
46 doi: 10.1016/j.matcom.2006.06.016.
47
[8] F. Allgöwer and A. Zheng, Nonlinear Model Predictive Control. Birkhäuser, 2012.
48
49 [9] B. Xu et al., “Parametric study on reinforcement learning optimized energy management strategy for a hybrid electric
vehicle,” Applied Energy, vol. 259, p. 114200, Nov. 2019, doi: 10.1016/j.apenergy.2019.114200.
50
51 [10] X. Lin, Y. Wang, P. Bogdan, N. Chang, and M. Pedram, “Reinforcement learning based power management for hybrid
52 electric vehicles,” in 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2014, pp. 33–38. doi:
10.1109/ICCAD.2014.7001326.
53
54 32
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 [11] X. Qi, G. Wu, K. Boriboonsomsin, M. J. Barth, and J. Gonder, “Data-Driven Reinforcement Learning–Based Real-
11 Time Energy Management System for Plug-In Hybrid Electric Vehicles,” Transportation Research Record, vol. 2572, no. 1, pp.
12 1–8, Jan. 2016, doi: 10.3141/2572-01.
13 [12] C. Liu and Y. L. Murphey, “Power management for Plug-in Hybrid Electric Vehicles using Reinforcement Learning
14 with trip information,” in 2014 IEEE Transportation Electrification Conference and Expo (ITEC), Jun. 2014, pp. 1–6. doi:
10.1109/ITEC.2014.6861862.
15
16 [13] T. Liu, X. Hu, S. E. Li, and D. Cao, “Reinforcement Learning Optimized Look-Ahead Energy Management of a Parallel
17 Hybrid Electric Vehicle,” IEEE/ASME Transactions on Mechatronics, vol. 22, no. 4, pp. 1497–1507, 2017, doi:
10.1109/TMECH.2017.2707338.
18
19 [14] Y. Zou, T. Liu, D. Liu, and F. Sun, “Reinforcement learning-based real-time energy management for a hybrid tracked
20 vehicle,” Applied Energy, vol. 171, pp. 372–382, Jun. 2016, doi: 10.1016/j.apenergy.2016.03.082.

21 [15] R. Xiong, J. Cao, and Q. Yu, “Reinforcement learning-based real-time power management for hybrid energy storage
22 system in the plug-in hybrid electric vehicle,” Applied Energy, vol. 211, pp. 538–548, Feb. 2018, doi:
10.1016/j.apenergy.2017.11.072.
23
24 [16] G. Du, Y. Zou, X. Zhang, Z. Kong, J. Wu, and D. He, “Intelligent energy management for hybrid electric tracked
25 vehicles using online reinforcement learning,” Applied Energy, vol. 251, p. 113388, 2019, doi: 10.1016/j.apenergy.2019.113388.
26 [17] Y. Hu, W. Li, H. Xu, and G. Xu, “An Online Learning Control Strategy for Hybrid Electric Vehicle Based on Fuzzy Q-
27 Learning,” Energies, vol. 8, no. 10, Art. no. 10, Oct. 2015, doi: 10.3390/en81011167.
28 [18] J. Wu, Y. Zou, X. Zhang, T. Liu, Z. Kong, and D. He, “An Online Correction Predictive EMS for a Hybrid Electric
29 Tracked Vehicle Based on Dynamic Programming and Reinforcement Learning,” IEEE Access, vol. 7, pp. 98252–98266, 2019,
30 doi: 10.1109/ACCESS.2019.2926203.
31 [19] B. Xu, F. Malmir, and Z. Filipi, “Real-Time Reinforcement Learning Optimized Energy Management for a 48V Mild
32 Hybrid Electric Vehicle,” SAE Technical Papers, 2019, doi: 10.4271/2019-01-1208.
33 [20] P. Wang, Y. Li, S. Shekhar, and W. F. Northrop, “Actor-Critic based Deep Reinforcement Learning Framework for
34 Energy Management of Extended Range Electric Delivery Vehicles,” in 2019 IEEE/ASME International Conference on
35 Advanced Intelligent Mechatronics (AIM), Jul. 2019, pp. 1379–1384. doi: 10.1109/AIM.2019.8868667.
36 [21] Z. Zhu, Y. Liu, and M. Canova, “Energy Management of Hybrid Electric Vehicles via Deep Q-Networks,” in 2020
37 American Control Conference (ACC), Jul. 2020, pp. 3077–3082. doi: 10.23919/ACC45564.2020.9147479.
38 [22] X. Qi, Y. Luo, G. Wu, K. Boriboonsomsin, and M. Barth, “Deep reinforcement learning enabled self-learning control
39 for energy efficient driving,” Transportation Research Part C: Emerging Technologies, vol. 99, pp. 67–81, Feb. 2019, doi:
40 10.1016/j.trc.2018.12.018.
41 [23] Y. Wu, H. Tan, J. Peng, H. Zhang, and H. He, “Deep reinforcement learning of energy management with continuous
42 control strategy and traffic information for a series-parallel plug-in hybrid electric bus,” Applied Energy, vol. 247, pp. 454–466,
43 Aug. 2019, doi: 10.1016/j.apenergy.2019.04.021.
44 [24] S. Inuzuka, F. Xu, B. Zhang, and T. Shen, “Reinforcement Learning Based on Energy Management Strategy for
45 HEVs,” in 2019 IEEE Vehicle Power and Propulsion Conference (VPPC), 2019, pp. 1–6. doi:
10.1109/VPPC46532.2019.8952511.
46
47 [25] J. Wu, Z. Wei, W. Li, Y. Wang, Y. Li, and D. U. Sauer, “Battery Thermal- and Health-Constrained Energy
48 Management for Hybrid Electric Bus Based on Soft Actor-Critic DRL Algorithm,” IEEE Transactions on Industrial Informatics,
vol. 17, no. 6, pp. 3751–3761, Jun. 2021, doi: 10.1109/TII.2020.3014599.
49
50 [26] J. Zhou, S. Xue, Y. Xue, Y. Liao, J. Liu, and W. Zhao, “A novel energy management strategy of hybrid electric vehicle
via an improved TD3 deep reinforcement learning,” Energy, vol. 224, p. 120118, Jun. 2021, doi: 10.1016/j.energy.2021.120118.
51
52 [27] A. Brooker, J. Gonder, L. Wang, E. Wood, S. Lopp, and L. Ramroth, “FASTSim: A model to estimate vehicle
53 efficiency, cost and performance,” National Renewable Energy Lab.(NREL), Golden, CO (United States), 2015.
54 33
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 [28] C. Baker, M. Moniot, A. Brooker, L. Wang, E. Wood, and J. Gonder, “Future Automotive Systems Technology
11 Simulator (FASTSim) Validation Report – 2021,” Renewable Energy, p. 101, 2021.
12 [29] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning,” p. 9.
13
[30] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
14 Feb. 2015, doi: 10.1038/nature14236.
15
16 [31] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized Experience Replay,” arXiv:1511.05952 [cs], Feb. 2016,
Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/1511.05952
17
18 [32] M. Andrychowicz et al., “Hindsight Experience Replay,” in Advances in Neural Information Processing Systems, 2017,
vol. 30. Accessed: May 17, 2022. [Online]. Available:
19 https://proceedings.neurips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html
20
21 [33] H. Hasselt, “Double Q-learning,” in Advances in Neural Information Processing Systems, 2010, vol. 23. Accessed: Apr.
30, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-
22 Abstract.html
23
[34] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, “Dueling Network Architectures for
24
Deep Reinforcement Learning,” arXiv:1511.06581 [cs], Apr. 2016, Accessed: Jan. 28, 2022. [Online]. Available:
25 http://arxiv.org/abs/1511.06581
26
[35] M. G. Bellemare, W. Dabney, and R. Munos, “A Distributional Perspective on Reinforcement Learning.” arXiv, Jul. 21,
27 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1707.06887
28
29 [36] M. Hessel et al., “Rainbow: Combining Improvements in Deep Reinforcement Learning,” arXiv:1710.02298 [cs], Oct.
2017, Accessed: Jan. 28, 2022. [Online]. Available: http://arxiv.org/abs/1710.02298
30
31 [37] S. Sutton, “Predicting and Explaining Intentions and Behavior: How Well Are We Doing?,” Journal of Applied Social
Psychology, vol. 28, no. 15, pp. 1317–1338, 1998, doi: 10.1111/j.1559-1816.1998.tb01679.x.
32
33 [38] R. S. Sutton and A. G. Barto, Reinforcement Learning, second edition: An Introduction. MIT Press, 2018.
34 [39] M. Fortunato et al., “Noisy Networks for Exploration,” arXiv:1706.10295 [cs, stat], Jul. 2019, Accessed: Apr. 30, 2022.
35 [Online]. Available: http://arxiv.org/abs/1706.10295
36 [40] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Distributional Reinforcement Learning with Quantile
37 Regression,” arXiv:1710.10044 [cs, stat], Oct. 2017, Accessed: Feb. 01, 2022. [Online]. Available:
38 http://arxiv.org/abs/1710.10044
39 [41] W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit Quantile Networks for Distributional Reinforcement
40 Learning,” in Proceedings of the 35th International Conference on Machine Learning, Jul. 2018, pp. 1096–1105. Accessed: Feb.
41 05, 2022. [Online]. Available: https://proceedings.mlr.press/v80/dabney18a.html
42 [42] D. Yang, L. Zhao, Z. Lin, T. Qin, J. Bian, and T.-Y. Liu, “Fully Parameterized Quantile Function for Distributional
43 Reinforcement Learning,” in Advances in Neural Information Processing Systems, 2019, vol. 32. Accessed: Apr. 29, 2022.
44 [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/f471223d1a1614b58a7dc45c9d01df19-Abstract.html
45 [43] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning.” arXiv, Jul. 05, 2019. Accessed: Jul. 17,
46 2022. [Online]. Available: http://arxiv.org/abs/1509.02971
47 [44] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-Critic Methods,”
48 arXiv:1802.09477 [cs, stat], Oct. 2018, Accessed: Apr. 29, 2022. [Online]. Available: http://arxiv.org/abs/1802.09477
49 [45] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust Region Policy Optimization.” arXiv, Apr. 20,
50 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1502.05477
51 [46] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms.” arXiv,
52 Aug. 28, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1707.06347
53
54 34
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10 [47] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement
11 Learning with a Stochastic Actor.” arXiv, Aug. 08, 2018. Accessed: Jul. 17, 2022. [Online]. Available:
12 http://arxiv.org/abs/1801.01290
13 [48] J. Weng et al., “Tianshou: a Highly Modularized Deep Reinforcement Learning Library.” arXiv, Sep. 22, 2021.
14 Accessed: Jul. 26, 2022. [Online]. Available: http://arxiv.org/abs/2107.14171
15 [49] O. US EPA, “Technology | Green Vehicle Guide.” https://www3.epa.gov/otaq/gvg/learn-more-technology.htm
16 (accessed Oct. 28, 2022).
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54 35
55
56
57
58
59
60
61
62
63
64
65
Highlights

Highlights

 A propulsion model is designed for a Hybrid Electric Vehicle in python and the entire
code is shared in Github.
 13 popular DRL algorithms are introduced and both discrete and continuous action
space algorithms are considered.
 Key measures, including convergence reward/ episode/ time and final test result, are
compared.
 The best DRL algorithms of each action space type are compared with baseline EMS.

You might also like