1 s2.0 S0360544222017091 Main

Energy 258 (2022) 124806
Contents lists available at ScienceDirect
Energy
journal homepage: www.elsevier.com/locate/energy
High robustness energy management strategy of hybrid electric vehicle

based on improved soft actor-critic deep reinforcement learning
Wenjing Sun, Yuan Zou *, Xudong Zhang, Ningyuan Guo, Bin Zhang, Guodong Du
National Engineering Laboratory for Electric Vehicles, School of Mechanical Engineering, and Collaborative Innovation Center of Electric Vehicles in Beijing, Beijing
Institute of Technology, 100081, Beijing, China
A R T I C L E I N F O A B S T R A C T
Keywords: As a hybrid electric vehicle (HEV) key control technology, intelligent energy management strategies (EMSs)
Energy management strategy directly affect fuel consumption. Investigating the robustness of EMSs to maximize the advantages of energy
Deep reinforcement learning savings and emission reduction in different driving environments is necessary. This article proposes a soft actor-
Soft actor critic
critic (SAC) deep reinforcement learning (DRL) EMS for hybrid electric tracked vehicles (HETVs). Munchausen
Munchausen reinforcement learning
Prioritized experience replay
reinforcement learning (MRL) is adopted in the SAC algorithm, and the Munchausen SAC (MSAC) algorithm is
constructed to achieve lower fuel consumption than the traditional SAC method. The prioritized experience
replay (PER) is proposed to achieve more reasonable experience sampling and improve the optimization effect.
To enhance the “cold start” performance, a dynamic programming (DP)-assisted training method is proposed that
substantially improves the training efficiency. The proposed method optimization result is compared with the
traditional SAC and deep deterministic policy gradient (DDPG) with PER through the simulation. The result
shows that the proposed strategy improves both fuel consumption and possesses excellent robustness under
different driving cycles.
1. Introduction expandability [7]. Optimization-based methods use the optimization

control algorithm to find the optimal solution that minimizes the
1.1. Literature review objective function. Optimization-based methods based on global opti
mization primarily include dynamic programming (DP) [8,9], Pon
HEVs powered by internal combustion engines (ICEs) and batteries tryagin’s minimum principle (PMP) [10,11], and the genetic algorithm
combine the advantages of both ICE vehicles and electric vehicles and (GA) [12]. These methods require knowing the driving cycles in
are an effective solution to achieve energy conservation and emission advance, and the calculation cost is high. In actual driving, the driving
reduction [1]. An energy management strategy (EMS), as the most cycle often changes dynamically, so these methods are used as the
crucial aspect of HEV energy distribution control of two power sources, benchmark [7]. To carry out real-time optimization, the equivalent
plays an important role in reducing fuel consumption and improving consumption minimization strategy (ECMS) [13,14] and model predic
comprehensive performance [2]. EMS has become an important topic in tion control (MPC) [15–17] have been the main research focus in recent
the efficient powertrain design domain. decades. The essence of an ECMS is to convert battery energy con
With the iteration and evolution of technology, energy management sumption into the engine’s equivalent fuel consumption [13]. ECMS
strategies gradually develop into two categories: rule-based methods converts the global optimization problem into a local problem, which
and optimization-based methods [3]. Rule-based methods, such as improves the calculation efficiency [18]. To exploit ECMSs, a key
blended charge depletion (BCD) [4], charge-depleting and challenge is to properly determine the equivalent factor (EF) [19].
charge-sustaining (CD/CS) [5], and fuzzy logic control [6], set fixed However, developing an adaptive equivalent factor in an ECMS to obtain
thresholds based on experimental experiences to switch engine and the flexibility of different driving cycles is still challenging [20]. MPC
battery usage. It is simple to design and easy to implement, but it relies has been widely studied due to its predictive characteristics and high
too much on expert experience and does not have portability and robustness, but the model dependence problem still limits the further
* Corresponding author.
E-mail addresses: sunw-enjing@163.com (W. Sun), zouyuanbit@vip163.com (Y. Zou), xudong.zhang@bit.edu.cn (X. Zhang), gny123@qq.com (N. Guo),
guodongdu_robbie@163.com (G. Du).
https://doi.org/10.1016/j.energy.2022.124806
Received 19 March 2022; Received in revised form 15 June 2022; Accepted 11 July 2022
Available online 14 July 2022
0360-5442/© 2022 Published by Elsevier Ltd.
W. Sun et al. Energy 258 (2022) 124806
algorithms. The policy network of deterministic algorithms outputs a

definite action for each state. These algorithms easily fall into a local
optimum in a random environment. Compared with the deterministic
algorithm, the stochastic algorithm is more robust because the policy
network outputs a probability distribution of action and encourages
exploration. As a representative stochastic algorithm, SAC-based EMS is
studied in Ref. [29]. With an emphasized consciousness of both thermal
safety and degradation of the onboard lithium-ion battery system, this
EMS improved the fuel economy by up to 23.3% compared to the
discrete-action DQL strategy. However, this paper does not further
improve the optimization effect of the algorithm or further explore the
robustness.
Through the above studies, DRL-based EMSs are restricted by two
constraints. (1) Poor real-time performance. Too short of a training time
prevents the neural network from converging, or the training to
converge takes too long; and (2) The robustness is uncertain. When there
are errors between the actual cycle and the training cycle, DRL-based
Fig. 1. Topology of series HETV.
EMS often fail to achieve the desired outcomes. With the development
of vehicle-to-everything (V2X) technology, many studies assume that all
Table 1 or most of the information about the future driving cycle can be obtained
Specification of reference hybrid electric vehicle. [30]. The EMS is trained offline with predicted driving cycles before it is
Parameters Values
ready for online applications. DRL-based EMSs have high application
potential under this assumption [31]. Hence, how to address the above
Total weight m 1500 kg
two defects and obtain a DRL-based EMS with high computational ef
Rolling resistance coefficient f 0.0494
Air resistance coefficient Cd 0.9 ficiency, good optimization effect, and strong robustness is crucial and
Frontal area A 2.94 m2 deserves to be further explored.
Transmission efficiency η 0.97
Equivalent electromotive force coefficient 1.6080 1.2. Motivation and innovation
Equivalent impedance coefficient 0.0098
Moment of inertia of the engine 0.22 kg⋅m2
Moment of inertia of the generator 0.2 kg⋅m2 Compared with existing studies, the present paper encompasses four
Battery capacity 45 Ah perspectives that may contribute to relevant research.
First, a state-of-the-art highly robust SAC-based DRL strategy is
deployed for series HETV energy management. To ensure robustness,
improvement of its optimization effect. the strategy makes a trade-off between exploitation and exploration
With the development of artificial intelligence and machine learning, using the information entropy model.
reinforcement learning (RL) has become a new approach to solve opti Second, the SAC algorithm is combined with the Munchausen
mization problems. Liu used Q-learning and Dyna algorithms to solve Reinforcement learning (MRL) method to establish the MSAC control
the energy management problem in Ref. [21]. However, the discrete framework for the first time. The MSAC algorithm can use the current
features make it difficult for RL to handle continuous control problems. strategy to bootstrap and improve control strategy optimization.
Therefore, deep neural networks with strong nonlinear fitting ability are Third, to achieve more reasonable experience sampling and improve
imported to solve this problem. This type of method is deep reinforce the optimization effect, prioritized experience replay (PER) is proposed
ment learning (DRL). The early DRL method is an enhancement of in this control framework.
Q-learning. The representative of such methods is the deep Q-learning Fourth, the optimization results of dynamic programming (DP) are
method (DQL) [22–24]. Reference [22] designed a DQL-based control innovatively used in the early training period to alleviate the “cold start”
framework and showed that fuel economy can be improved by problem. This DP-assisted training method can greatly reduce training
approximately 5.9% compared with the traditional RL strategy. How episodes and improve training efficiency.
ever, DQL still cannot handle tasks with continuous and The superiority of the proposed strategy is validated by comparing it
high-dimensional control actions. This significantly reduces the overall with existing strategies in terms of the training time, equivalent fuel
optimality of the control. To obtain more accurate control and optimi consumption, regression of terminal SOC and robustness under different
zation results, more deep neural networks are employed in DRL algo noises.
rithms. Such algorithms primarily include DDPG [18,25–27], TD3 [28]
and SAC [29]. Reference [18] combined DDPG with a rule-based 1.3. Organization
expert-assistance system to address the problem of random explora
tion in the “cold-start” stage. The results show that the training effi The remainder of the paper is organized as follows. In Section 2, the
ciency more than doubled due to the participation of the powertrain model of the HETV and the formulas of the energy man
expert-assistance system. A control framework for offline training and agement problem are modeled in detail. The control framework of the
online application based on DDPG is established in Ref. [25], which proposed strategy, including the Munchausen reinforcement learning
achieves an average 3.5% gap from the DP benchmark, superior to method, prioritized experience replay, and DP-assisted training, is
MPC-based EMS with accurate prediction. Reference [28] proposed an illustrated in Section 3. Section 4 shows the simulation results and
improved TD3-based EMS, which added a heuristic rule-based local analysis, followed by the main conclusions in Section 5.
controller and a hybrid experience replay (HER) method, allowing the
optimization effect to reach 94% of DP. Moreover, an optimal experi 2. Powertrain modeling and problem formulation
ence buffer, which stores the global-optimality policy resolved by
adopting DP, is created to utilize historical experiences to train the EMS. The studied vehicle is a series HETV. The configuration diagram is
This is a precedent for using DP in DRL training. However, all shown in Fig. 1, which is composed of an engine–generator set (EGS), a
DDPG-based EMSs and TD3-based EMSs studied above are deterministic lithium-ion battery, a power distribution unit (PDU), and two electric
2
W. Sun et al. Energy 258 (2022) 124806
⎧ ( )
⎪
⎨ Teng − Tg = 2π Je + Jg dng
⎪
ieg 60 i2eg dt (2)
⎪
⎪
⎩
ng = ieg neng
where Teng is the torque of the engine and neng is the speed of the engine.
ieg is the transmission ratio between the engine and generator. In this
vehicle, the engine and generator are directly connected by coupling, so
ieg is 1. Je and Jg stand for the moment of inertia of the engine and
generator.
2.2. Lithium-ion battery model
Fig. 2. Equivalent circuit of the EGS. In this paper, the internal resistance model is used to model the
lithium-ion battery system [32], and the equivalent circuit diagram of
the battery is shown in Fig. 3.
According to the equivalent circuit, the output voltage Ub and the
output power Pb of the battery are formulated as follows.
{
Ub = Vbat − Ib Rbat
(3)
Pb = Vbat Ib − Ib2 Rbat
The derivative of the SOC is expressed as follows.

√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
⋅ (Vbat − V 2bat − 4Rbat Pb (t))
SOC = (4)
2Cb Rbat
where Vbat is the open circuit voltage of the battery, which changes as
the SOC changes. Rbat is the internal resistance of the battery; according
to the discharging and charging status of the battery, its value is equal to
Rd or Rc . Cb is the total battery capacity.
2.3. Power request model

Fig. 3. Equivalent circuit of the lithium-ion battery.
According to the dynamic formula of the tracked vehicle, the tracked

motors. These electric motors are powered by the EGS and the battery
vehicle required power is expressed as follows:
with the same rated power. The related vehicle parameters are illus
trated in Table 1. The EGS and battery in the powertrain structure are CD A 2 dv
Pdem = (mgf cos α + mg sin α + ⋅ v + δm )v + Mr ω (5)
modeled in the following, and the formulas for the power demand model 21.15 dt
are also given in detail.
where the first part stands for the heading power and the second part
stands for the steering power. m is the vehicle mass, g andf denote the
2.1. Engine–generator set model
gravity acceleration and rolling resistance coefficient.CD andA mean the
aerodynamic coefficient and the frontal area. is the rotary curb mass
The engine–generator set (EGS) incorporates an electronic speed
regulating diesel engine and a permanent magnet synchronous gener conversion factor. The forward velocity is v, and the rotational speed is
ator. The alternating current (AC) generated by the generator is recti ω. Mr represents the steering resistance moment. The formulations of the
fied, and the direct current (DC) output is coupled directly with the above variables can be described as follows:
power battery. The EGS and the rectifier module are simplified as an ⎧
⎪ v1 + v2
⎪ v=
equivalent circuit, which is illustrated in Fig. 2. ⎪
⎪
⎪ 2
⎪
The equations of output voltage Ug , output current Ig , and electro ⎨ v1 − v2
w= (6)
magnetic torque Tg on the DC side of the engine-generator set are as ⎪
⎪ B
⎪
⎪
follows: ⎪
⎩ Mr = 1 μmgL
⎪
{ 4
Ug = Ke ng − Kx ng Ig
Tg = Ke Ig − Kx Ig2 (1) where v1 and v2 are the velocities of the two tracks, B is the vehicle tread,
μ is the coefficient of steering resistance andL is the ground contact
length of the track.
where ng is the speed of the generator and Ke , Kx stand for the equivalent
Moreover, the equilibrium equation of the power demand and the
electromotive force coefficient and equivalent impedance coefficient,
electric power is as follows:
respectively. The input torque of the generator comes from the output
torque of the engine, and the two form a dynamic balance when work Pdem = (PEGS + Pb )η (7)
ing. The dynamic model of the EGS is established according to Newton’s
law, and the torque balance equation is as follows: where PEGS is the power provided by the EGS and Pb is the power pro
vided by the battery. η denotes the efficiency between electrical power
and mechanical power.
3
W. Sun et al. Energy 258 (2022) 124806
2.4. Energy management problem formulation 3.1. Formulation of SAC-DRL algorithm
In this paper, we use the DRL framework to solve the energy man The most important feature of SAC is the use of information entropy.
agement problem. According to the basic framework of DRL, state var Information entropy is a measure of the randomness of a strategy. A
iables, the control variable, the reward function, and the constraint larger entropy implies higher randomness of the strategy. The high
condition are set as follows. randomness is obtained by more exploration, which allows the algo
The vehicle state is described by the state of the powertrain and the rithm to learn more ways to compute the optimal solution. Therefore,
state of motion [33]. We define the power demand Pdem and speed v to the algorithm has more output possibilities in the face of perturbations,
represent the state of motion. The energy of the vehicle is provided by guaranteeing sufficient robustness. The SAC makes a trade-off between
the powertrain, consisting of the EGS and the battery. Therefore, we maximizing expected rewards and maximizing entropy. This makes the
define the speed of the generator ng to describe the state of the EGS and policy robust while ensuring optimality. The best policy function of SAC
define SOC to describe the state of the battery. A 4-D vector is used to can be expressed as follows [34]:
describe the state variable, which is expressed as follows. ∑
π* = argmaxЕ(st ,at )∼ρπ [ R(st , at ) + αH(π ( ⋅ |st ))] (12)
π
st = {Pdem , v, SOC, ng }T (8) t
In this series HETV, without any DC/DC converter or driving where the first part is the reward, which is the same as the traditional
component, the output power of the generator is coupled with the bat DRL, and the second part is the entropy, which is unique to SAC. The
tery directly, which reduces the control difficulty. Because the throttle of definition of information entropy is as follows:
the engine can determine the engine output torque, it directly affects the
H(P) = Е [ − log P(x)] (13)
fuel consumption and power distribution of the powertrain, so we select x∼P
it as the control variable, which is given by:

where x follows the distributionP.
at = {thr|thr ∈ [0, 1]} (9) The value function of SAC, which is given by the soft Bellman iter
The reward function is used to express the essence of the control ation, can be expressed in the following formula [35]:
problem, which is to dynamically adjust the energy distribution rela Q(st , at ) = r(st , at ) + γ⋅Еst+1 ,at+1 [Q(st+1 , at+1 ) − α log(π(at+1 |st+1 ))] (14)
tionship between the EGS and the battery to minimize the fuel con
sumption during the whole operation process and simultaneously make where γ is the discount factor and α is the temperature hyperparameter,
the SOC return to the target value at the end. Its reward function can be which is adaptively tuned. For the policy function, soft policy
expressed by the following formula: improvement can be achieved by using Kullback–Leibler (KL) diver
⋅ gence to approach the expectation of the value function, which can be
rt (st , at ) = − {k1 ⋅ frate (t) + k2 ⋅ [soc(t) − soctar ]2 } (10) expressed as:
where frate (t) is the instantaneous fuel consumption rate and soctar is the exp(α1 Q(st , ⋅))
(15)
′
π = arg min
∏DKL (πk ( ⋅ |st )||
target value of soc. In this paper, we define it as soctar = soc(t0 ). k1 and k2 πk ∈ Z(st )
are the weighting factors of the fuel consumption and battery SOC offset,
∏
respectively. where is the policy set, DKL is the KL divergence, and Z(st ) is a loga
To ensure the accuracy and rationality of the EMS solution, the rithm partition function that is used to normalize the distribution. It is a
following parameters in the vehicle model also need to meet the hard constant that does not affect gradient descent.
constraints of the physical system: The SAC algorithm is composed of three deep neural networks
⎧ (DNNs), a policy network π to generate policy, and two value networks Q
⎪ socmin ≤ soc(t) ≤ socmax
⎪
⎪
⎪ to generate the Q value. For faster and more stable training, SAC chooses
⎨ Pb,min ≤ Pb (t) ≤ Pb,max
ng,min ≤ ng (t) ≤ ng,max (11) the smaller Q value of the two value networks as the target Q value each
⎪
⎪
⎪ Ig,min ≤ Ig (t) ≤ Ig,min time. To mitigate the overestimation problem, the training process of the
⎪
⎩
Teng,min ≤ Teng (t) ≤ Teng,max two value networks also uses two target value networks Q’.
According to the optimal Bellman equation, the update of the value
network is given by:
2
1
(16)
′
JQ (θ) = E(st ,at ,st+1 )∼D,at+1 ∼πφ [ (Qθ (st , at ) − (rt + γ(Qθ (st+1 , at+1 ) − α log(π φ (at+1 |st+1 ))))) ]
2
3. Energy management strategy based on improved soft actor-

critic framework where D is the replay buffer, which stores data from previous experi
ences, including {at ,st ,rt ,st+1 }. The target value network performs a soft
In this section, the overall control framework of the improved SAC update, which can be represented as:
will be introduced in detail. The MRL method combined with SAC for the ′ ′
first time achieves a significant breakthrough through small changes in θQ = (1 − τ)θQ + τθQ (17)
the control effect. The PER achieves more reasonable experience sam
pling and improves the optimization effect, and the results of DP are where τ is the weighting factor of the soft update per step. The update of
innovatively used to assist the training process, which greatly acceler the policy network can be expressed as follows:
ates the initial convergence. 1
Jπ (φ) = Est ∼D,at ∼πφ [log πφ (at |st ) − Qθ (st , at ) + log Z(st )] (18)
α
The temperature hyperparameter α adopts an automatic adjustment
4
W. Sun et al. Energy 258 (2022) 124806
Fig. 4. Complete framework of the proposed DRL method for energy management control.
method. This method is based on the idea that the policy should keep the where fφμ (st ) is the mean value, fφσ (st ) is the variance, and εt is the random
entropy greater than a threshold while maximizing the expected reward.
noise that follows the normal distribution.
The update of α can be expressed as follows:
J(α) = Eat ∼πt [ − α log πt (at |st ) − αН 0 ] (19)
3.2. Munchausen Reinforcement learning method
where Н 0 is the minimum policy entropy threshold.
Inspired by information entropy, this paper innovatively combines
Because SAC is a stochastic algorithm, the action is not directly
the SAC method with the MRL method for the first time and constructs
output by the policy network but through the sampling. The action
the MSAC framework. This can greatly speed up convergence and
function is defined as:
improve optimization effects through minor changes. MRL is a bootstrap
at = fφ (εt ; st ) = fφμ (st ) + εt ⊙ fφσ (st ) (20) method using the current policy. It constructs the Munchausen reward
by adding log-policy to real-time reward for optimization [36].
5
W. Sun et al. Energy 258 (2022) 124806
Table 2 Table 3
The pseudocode of the proposed method. Key hyperparameters of the MSAC framework.
Algorithm: DP-assisted MSAC with prioritized experience replay Parameter Value
1: Randomly initialize the network parameters ϕ, θ1, θ2 Size of the replay buffer ND 100,000
2: Initialize the weight parameters of the target value network θ1’ = θ1, θ2’ = θ2 Batch size nb 64
3: Initialize the replay buffer, hyperparameters Initial temperature parameter α0 0.2
4: for episode ¼ 1 to H, do: Initial learning rates lar lcr , lαr 0.001
5: Resetting initialize state to s0 = {Pdem0, v0, SOC0, ng0}T Soft update factor τ 0.005
6: for t ¼ 1 to total steps, do: Discount factor γ 0.99
7: if episode < DP assisted training episodes, do: Prioritization factor αPER 0.6
8: Getting control action from DP solution: at = at_optimal Compensation factor βPER 0.4
9: else, do:
10: Generating control action applying proposed method at = πϕ (at|st)
11: end if
12: Obtain Munchausen reward rt and next step state st+1 pi α
P(i) = (24)
13: Store transition (st, at, rt, st+1) in replay buffer and set priority of experience pi ∑
k
14: if t > memory size of replay buffer ND, do: pαk
i=1
15: for i ¼ 1 to batch size nb, do:
16: Get experience i with probability Pi
17: Calculate importance sampling weight ωi where α is the prioritization factor, which determines how much pri
18: Compute TD-error and update the priority of experience i according to TD- oritization is used, and α = 0 corresponds to uniform sampling. pi > 0 is
error the priority of experience i and can be expressed as:
19: end for
20: Update the value networks: θi = θi − lcr ωi ∇JQ (θi ) for i ∈ {1, 2} pi = |δi | + ε (25)
21: Update policy network: ϕi = ϕi − lar ∇Jπ (ϕ)
22: Update temperature hyperparameter: α = α − lαr ∇J(α) where |δi | is the TD error and ε is a small positive constant that prevents
Update the target value networks: θi Q = (1 − τ)θi Q + τθi Q for i ∈ {1, 2} the edge case of experience from being revisited once their TD error is
′ ′
23:
24: end if zero.
25: end for
26: end for
Due to the random priority sampling of PER, a bias is inevitably
introduced. To correct the bias, the importance sampling weight is used:
1 1 β
ωi = ( ⋅ ) (26)
N P(i)
rM (st , at ) = r(st , at ) + log(π(at |st )) (21)
MRL can lead bootstrapping based on the fact that current policy can where β is the factor of compensation. These weights can be folded into
impact the next action. If the optimal policy is known, π(at |st ) = 1. The the Q-learning update by using ωi ∇JQ instead of ∇JQ .
log policy is 0, and all other suboptimal log policies are negative infinity.
This is a strong signal to guide the algorithm to suppress suboptimal 3.4. Dynamic programming assisted training method
solutions, and its addition to the reward does not change the optimal
policy. In the initial training process of SAC, the agent outputs actions
The optimal Bellman equation and the update of the Q network based through random exploration, which is often on the boundary. For
on the MSAC become: example, the action is 0, which means that the throttle of the engine is 0,
the engine stops and the power is provided entirely by the battery, so the
Q(st , at ) = rM (st , at ) + γ⋅Еst+1 ,at+1 [Q(st+1 , at+1 ) − α log(π (at+1 |st+1 ))] (22)
fuel consumption is 0. Alternatively, the action is 1, where the engine is
operating at maximum speed and fuel consumption is highest. Due to the
1
JQ (θ) = E(st ,at ,st+1 )∼D,at+1 ∼πφ [ (Qθ (st , at ) − (rM (st , at ) large state space, completely random exploration makes the rewards
2 (23) very sparse, and the problem of gradient disappearance often occurs
+γ(Qθ′ (st+1 , at+1 ) − α log(π φ (at+1 |st+1 )))))2 ] when sigmoid and tanh functions are used [18]. In this process, the
It should be mentioned that information entropy in SAC is to subtract agent learns very slowly or even cannot learn. This is called the “cold
all next Q values of the next decision from log-policy, while MRL is based start.” To speed up the training process and alleviate the problem of cold
on the judgment of the current action. start, this paper uses the optimal solution of DP to assist the training
process.
3.3. Prioritized experience replay Before the training process begins, the global optimal solution of the
training cycle is solved by DP, and the optimal action sequence is
SAC, as an off-policy strategy, stores historical experiences in a derived. During the first few rounds of training, agents do not randomly
replay buffer to remember and reuse them. Normally, SAC obtains explore actions but train with the optimal action sequence. However, to
experience through uniform sampling. However, in this method, each ensure the robustness of the strategy and information entropy can be
experience is sampled with the same probability, and the importance of fully considered, after the “cold start” stage, that is, when the decline
each experience is ignored, resulting in low sampling efficiency [37]. rate of loss slows down, the assisted training is stopped, and the action
Previous studies have found that TD error can be used as a measurement generated by the agent is used to continue the training until conver
index to determine the importance of every experience, but greedy gence. The method of DP-assisted training can completely avoid the
preferential sampling based solely on the size of TD error will lead to problem of “cold start” guide the rapid decline of loss in the early stage
overfitting of the algorithm. To improve the sampling efficiency and of training, and ensure the robustness and optimality of the strategy,
improve the accuracy of the algorithm, PER is proposed in this paper which can greatly shorten the training time.
using random priority sampling and annealing the bias to avoid this
problem. Random priority sampling combines greedy preferential 3.5. Complete control framework of the MSAC
sampling with uniform random sampling. The probability of sampling
experience is defined as [38]: The proposed MSAC strategy is shown schematically in Fig. 4, and
the pseudocode of the proposed method is demonstrated in Table 2. It is
composed of three deep neural networks (DNNs), which constitute fully
6
W. Sun et al. Energy 258 (2022) 124806
Fig. 5. Driving cycle for training the NN.
connected layers, including 2 hidden layers with 256 nodes each. The The training cycle is 1130 s in total, and the speed trajectory is shown
Adam optimizer is used for updating the networks, and the associated in Fig. 5. The data of the training cycle are collected by a real vehicle,
hyperparameters are given in Table 3. which is a typical working cycle of the vehicle model in this paper. In the
training stage, the initial state variable of SOC is set as 0.75, and the
4. Simulation results and discussion generator speed is set as 2200 rpm.
4.1. Validation of the training process (1) The influence of the MRL
In this section, the global optimal solution of DP is used as the The reward curves of the MSAC and SAC are shown in Fig. 6. The
optimal reference benchmark. The convergence speed and fuel optimi rewards of the MSAC rise rapidly, and the reward value of the conver
zation effect of the proposed method are compared with those of other gence point of the MSAC is higher than that of the SAC. Under the same
SAC-based EMSs. As a very popular DRL in the field of EMS, DDPG is constraint function, a higher reward means lower fuel consumption and
used as a comparative strategy to verify the fuel economy and robustness better return of SOC to the target value. This is also demonstrated by the
of the proposed method. To make the comparison fairer, we also use PER performance of MSAC and SAC in Table 6. In the convergence process
for DDPG. To compare the fuel economy of various methods more fairly (especially 20 to 90 episodes), the average reward change between two
and objectively, we use the linear regression method [39] to obtain the adjacent episodes for the MSAC and SAC is 587.01 and 750.13,
equivalent fuel consumption. The equivalent fuel consumption is the respectively. This means that the reward curve of the MSAC rises more
fuel consumption obtained when the final value of soc is regressed to gently. The convergence point of the MSAC is at 89 episodes, which is
75%. less than that of the SAC. This is because the update of the MSAC’s Q
network considers the influence of the current policy, which makes the
gradient descent faster and smoother and thus can converge to the
optimal solution more quickly. In summary, the combination of the MRL
method and SAC to build the MSAC framework is effective in improving
the optimization effect and convergence speed.
(2) The influence of PER
The training loss curves of MSAC and MSAC(PER) are depicted in

Fig. 7. When the loss is reduced to a certain value, the subsequent
decline of loss does not affect the equivalent fuel consumption, and we
assume that the algorithm has converged. In this training cycle, this
value is 2.18e4 for MSAC and 200 for algorithms adding PER. The loss
curve of MSAC decreases very slowly in the middle of the training
process. The loss curve appears to show a large decline in 80–85 epi
sodes and finally converges at 89 episodes. This decline may be because
Fig. 6. Reward curves of the MSAC and SAC. the information entropy of the model increases in some iterations in the
80–85 episodes, which in turn makes the exploration larger, and the
direction of the exploration makes the algorithm accelerate to converge
to the global optimal point. However, we want the convergence process
to be as smooth as possible. Because fluctuating gradient descent may
cause the strategy to miss the global optimum and lead to divergence or
convergence to a local optimum. The loss curve of MSAC(PER) does not
resemble the phenomenon of MSAC mentioned above. The training
process of MSAC (PER) converges at 80 episodes, and the training time is
reduced by 10% compared with that of MSAC. As shown in Table 6, the
equivalent fuel consumption of MSAC(PER) is also lower than that of
MSAC. The above phenomenon shows that PER can accelerate the
learning speed, improve the optimization effect, and make the conver
gence process as smooth as possible by better using valuable
experiences.
The equivalent fuel consumption of MSAC(PER) and MSAC is 678.41
g and 692 g, respectively. This shows that the fuel economy of MSAC
Fig. 7. Training loss curve of MSAC(PER) and MSAC.
7
W. Sun et al. Energy 258 (2022) 124806
Fig. 8. Power distributions of the three methods.
the battery of MSAC(PER) is closer to that of DP, while MSAC’s is much

Table 4
larger than DP’s. In summary, PER can improve the optimization effect
Power distribution during deceleration. and guide MSAC to better approach the global optimal solution.
Methods Average generator Average battery Maximum battery

output power output power charging power
(3) The influence of DP-assisted training
DP 4.77 kW − 3.66 kW − 12.87 kW

During the training process, we find that too few episodes (less than
MSAC 5.02 kW − 3.91 kW − 13.47 kW
(PER) 10 episodes) of DP-assisted training will not alleviate the “cold start” of
MSAC 6.14 kW − 5.03 kW − 17.31 kW the algorithm. Additionally, too many episodes (more than 20 episodes)
will affect the robustness of the algorithm, which will eventually lead to
the ineffectiveness of the algorithm in the verification cycles. We finally
(PER) is better than that of MSAC. The braking energy recovery of the choose to use the DP solution to guide the training for 15 episodes. It can
power battery is obvious in the deceleration processes of the training not only alleviate the “cold start” but also ensure the robustness of the
cycle. How to deal with the power distribution during braking is also a algorithm.
key to improving the fuel economy. The power distributions of DP, During these 15 episodes, the agent does not explore randomly but
MSAC, and MSAC(PER) are visualized in Fig. 8. The average output runs the optimal action trajectory of DP to update the neural network.
power of the generator and battery during deceleration, as well as the The loss function of DP-assisted training drops significantly faster than
maximum charging power of the battery, are listed in Table 4. DP, as the that of non-DP in Fig. 9. This is because DP-assisted training cancels
global optimal solution, indicates that the output power of the generator random exploration in the initial stage of training. The algorithm is not
is relatively small in the process of deceleration, while MSAC allows the only immune to the influence of extremely unreasonable experiences
generator to output larger power. Therefore, when the battery absorbs caused by random exploration. Moreover, the optimal action trajectory
the braking energy, the generator still charges the battery with large of DP provides a series of high-value experiences, which can guide the
power in the power distribution of the MSAC. We highlight this phe loss to drop rapidly and alleviate the “cold start.” Loss increases in
nomenon with purple boxes in Fig. 8. The average output power of the several episodes (16 episodes to 19 episodes in Fig. 9) when the auxiliary
generator and battery during the deceleration of MSAC(PER) is more training has just stopped. This is because the neural network has not yet
similar to that of DP. At the same time, the maximum charging power of converged, and the values of the generated actions cannot match DP’s
8
W. Sun et al. Energy 258 (2022) 124806
Although the result of DP-assisted training provides little improve

ment in fuel consumption compared with MSAC(PER) in Table 6, it can
help SOC to better return to the target value. This is also confirmed by
the changing trends of the SOC curves of the three methods in Fig. 10. In
the first 600 s, their trends are similar. However, after 600 s, the change
of the proposed method is more similar to that of DP. The SOC visibly
rises and returns to the target SOC. This indicates that DP can not only
help the algorithm alleviate the “cold start” but also guide the policy
convergence to the global optimum to a certain extent.
To verify the universality of the DP-assisted training method, it is
also used for SAC and MSAC. The training episodes and convergence
episodes are listed in Table 5. This shows that the DP-assisted training
method can indeed guide rapid strategy convergence and reduce time
consumption under the SAC framework.
Fig. 9. Training loss curve of DP-assisted MSAC(PER) and MSAC(PER).
(4) Validation of the overall driving cost optimization
The performance of the proposed method, SAC-, and DDPG-based

EMSs is shown in Table 6, wherein DP undoubtedly achieves the best
fuel optimality. The fuel consumption of the proposed EMS only in
creases 5.13% after SOC correction, which is the best compared with
other DRL-based EMSs. The final SOC also accurately returned to the
target SOC, which demonstrated the superiority of the proposed EMS.
The engine working points using the proposed method, DDPG and SAC-
based EMSs are visualized in Fig. 11. The contour lines in the distribu
tion diagram represent the different fuel consumption rates. The points
of the proposed method are more frequently distributed in the lower fuel
consumption rate region (as shown by the green dashed line) than those
of the two DRL-based EMS comparisons. Therefore, the proposed
method can achieve higher fuel economy compared with the other DRL
methods. In conclusion, the proposed method has the best fuel economy
Fig. 10. Different SOC variations of the three methods. and better regression of target SOC among all the comparison DRL-based
methods and is the suboptimal solution second only to DP.
optimal action trajectory. However, the value of the loss function is still
far lower than that of the MSAC(PER) without assistance at this stage. 4.2. Robustness verification of the proposed method
From the enlarged figure, the loss of DP-assisted training is lower than
that of PER in the same training episode, which also indicates that the The above verification assumes that the actual driving cycle can be
optimal solution of DP can guide the agent to converge to the global obtained in advance and used for training. However, the actual driving
optimum. cycle will inevitably have errors with the predicted training driving
cycle [40]. In this paper, artificial Gaussian noises of different intensities
Table 5 are added to the training cycle as the actual cycle to verify the robustness
DP-assisted training effect of the SAC framework. and adaptability of the proposed EMS, which is trained by the training
Methods Training Converge Time consumption
cycle in Fig. 5. The Gaussian noises can be produced by the following
Episodes Episodes (s) function in the MATLAB library:
SAC / 95 1531 vnew = vorig + wgn (length(vorig ), 1, m) (27)
DP + SAC 15 75 1199
MSAC / 89 1441
where vnew is the new drive velocity with added Gaussian noises. vorig is
DP + MSAC 12 71 1139
MSAC (PER) / 80 1303 the original drive velocity. wgn( ⋅) is a function in the MATLAB library
DP + MSAC 15 66 1065 used to generate Gaussian noise. vorig is the length of the driving cycle,
(PER) and m denotes the noise intensity to simulate the complexity of driving
conditions. The locally weighted scatterplot smoothing function in the
Table 6
Performance of various EMSs.
Algorithm Proposed method MSAC (PER) MSAC SAC DDPG (PER) DP
Fuel consumption(g) 678.06 660.30 691.00 705.31 694.00 655.46

Terminal SOC 74.99% 74.42% 74.97% 75.15% 75.04% 75.34%
Equivalent Fuel consumption (g) 678.06 678.41 692.00 700.81 693.00 645.00
Relative increase of DP 5.13% 5.18% 7.29% 8.65% 7.44% 0%
Time consumption (s) 1065 1303 1441 1531 1381 3231
9
W. Sun et al. Energy 258 (2022) 124806
Fig. 11. Distributions of engine working points of three DRL-based EMSs.
MATLAB library is also introduced to smooth the speed profile: while DDPG(PER) is 20.25%, indicating that the performance of the
proposed method is more stable under different noises. Even under a 25
(28)
′ ′
v′new = smooth(vnew , n, lowess ) dBw noise level, the fuel consumption of the proposed method is only
12% higher than that of DP, which is much lower than the 20% of DDPG
where smooth( ⋅) is the smoothing function and n is the window width,
(PER). This shows that the proposed method can still guarantee
which can affect the smoothness of the speed schedule. lowless is a
robustness in the face of high error. Moreover, the deviation of the final
parameter specifying the local regression method for smoothing data.
value SOC and the target SOC of the proposed method is basically within
Four groups of different parameters in Table 7 are employed in Eqs. (27)
2%. However, the deviation of DDPG(PER) is 3%–4%. Therefore, the
and (28), and the new test cycles are yielded in Fig. 12.
proposed method has great robustness under the test cycle with different
As a comparison strategy, DDPG(PER) does not use DP-assisted
noise levels.
training. We also attempted to use this method for DDPG(PER), but it
Neither of these two methods shows a linear relationship in which
does not seem to work. However, we assume that DP-assisted training
the larger the noise is, the worse the optimization effect. However, with
has no impact on robustness validation. Its main significance is to alle
the increase in the noise, the changing trend of the proposed method’s
viate the “cold start” problem and accelerate convergence. After assisted
relative increase in DP is different from DDPG(PER). We think the reason
training, the proposed method still has a large number of episodes to
for this phenomenon is related to the exploration of the training process.
explore according to the information entropy. From another point of
The significance of the exploration in the training process is to help the
view, the problem of DP itself is that it has poor adaptability and
algorithm find the global optimum. At the same time, the exploration
robustness. Using a fixed DP optimal solution to train the algorithm
helps the algorithm obtain as many solutions as possible to complete the
should also be negative if there is an impact on the robustness of the
specified task, thereby improving the robustness in the face of noise.
algorithm. However, our verification does not show such a problem.
However, the exploration direction is random. We have not identified
The results of the proposed method and DDPG(PER) are summarized
which noise the method faces will result in better performance. There
in Table 8. It should be mentioned that the relative increase is compared
fore, the optimization effect of the same method will fluctuate when
with the equivalent fuel consumption of the DP optimal solution of each
faced with different errors, and the change trends of the optimization
cycle. The average relative increase of the proposed method is 8.75%,
effect of different methods are also different. But more exploration
means more learning, which is the reason why the proposed method
Table 7 outperforms DDPG(PER) in each cycle.
Parameters for generating new test cycles with random Gaussian noise. If the V2X signal is limited and the vehicle cannot predict the future
Driving schedule m[dBw] N driving cycle, the strategy trained by the typical training cycle in Fig. 5
requires further verification of adaptability. We choose another typical
Cycle 1 10 10
Cycle 2 15 10
driving cycle collected by a real vehicle, as shown in Fig. 13. This cycle is
Cycle 3 20 20 a low-speed cycle with an average speed of 14.07 km/h and a maximum
Cycle 4 25 20 speed of 30.74 km/h (the average speed of the training condition is
21.87 km/h and 39.69 km/h). The results of the proposed method and
DDPG(PER) are summarized in Table 9.
10
W. Sun et al. Energy 258 (2022) 124806
Fig. 12. Four drive cycles with random Gaussian noise.
11
W. Sun et al. Energy 258 (2022) 124806
Table 8 Table 9
Performance under four drive cycles with Gaussian noise. Performance under a low-speed verification cycle.
Algorithm Driving Fuel Terminal Equivalent Relative Algorithm Fuel Terminal Equivalent fuel Relative
cycle consumption SOC fuel increase consumption SOC consumption (g) increase of
(g) consumption of DP (g) DP
(g)
Proposed 592.93 75.63% 574.03 13.5%
Proposed Cycle 1 656.43 74.11% 683.13 6% method
method Cycle 2 656.80 73.25% 708.81 10% DDPG(PER) 673.18 74.81% 678.88 34.2%
Cycle 3 626.77 73.21% 680.47 7% DP 512.05 75.21% 505.75 /
Cycle 4 692.00 75.46% 678.22 12%
DDPG Cycle 1 693.15 71.98% 783.75 19%
(PER) Cycle 2 676.56 71.77% 773.46 18%
Cycle 3 698.58 71.78% 791.96 24%
optimality. Furthermore, the optimal solution of DP is directly used for
Cycle 4 658.65 72.45% 734.64 20%
training to alleviate the cold start problem caused by random explora
tion. Comparative studies with existing state-of-the-art techniques have
been performed in terms of fuel optimality, training effort, robustness,
Based on the above studies on robustness and adaptability, the pro and adaptability. The major conclusions are drawn as follows.
posed method has obvious advantages compared with DDPG (PER),
caused by the following three aspects: (1) The performance of the proposed method is compared with SAC
and DDPG(PER) to verify its superiority. Attributed to the
(1) DDPG-based DRL training ends as long as an optimal path is participation of the MRL method, PER and DP-assisted training,
found in the process of neural network learning. However, the the results indicated that the proposed EMS obtained the best fuel
application of information entropy under SAC-based DRL re optimality (only increases 5.13% relative to DP) and fastest
quires the algorithm not only to consider the optimal solution but convergence speed (improved 30.5% compared with SAC).
also to explore all possible optimal paths and various optimal (2) The MRL method and PER can improve fuel optimization and
possibilities in different ways. Therefore, it is easier to adjust and speed up the training process at the same time. Moreover, DP-
find the optimal solution in the face of noise. assisted training can greatly improve the training speed and
(2) As a stochastic algorithm, the policy network of SAC outputs the accelerate the decrease of loss in the initial training, but it does
probability distribution of an action. The action of SAC needs to not greatly improve fuel consumption.
be sampled. As a deterministic algorithm, DDPG outputs a defi (3) Compared with DDPG, the proposed method has very high
nite action value. This feature not only allows SAC to explore robustness and adaptability under different verification cycles. It
better during training but also provides more possibilities for the is 8.75% higher than DP on average in error conditions and
algorithm when faced with errors. 13.5% higher than DP in low-speed verification conditions.
(3) The training of DDPG requires many more hyperparameters to be
adjusted than that of SAC. To encourage exploration of DDPG, OU Our future research will focus on real-world vehicle validations and
error parameters usually need to be adjusted manually. However, the consideration of platoon control. Moreover, the analysis and com
hyperparameters such as the temperature coefficient, which parison of more EMSs are worth completing.
controls the exploration degree of SAC, are adjusted
automatically. Credit author statement
5. Conclusion Wenjing Sun: Conceptualization, Methodology, Software, Formal

analysis, Writing - Original Draft, Writing - Review & Editing. Yuan Zou:
This article proposes an information entropy-based, high conver Conceptualization, Resources, Supervision, Funding acquisition.
gence rate and high robustness SAC-based energy management strategy Xudong Zhang: Validation, Investigation, Data Curation, Project
for series HETVs. The Munchausen Reinforcement learning method administration. Ningyuan Guo: Data Curation, Investigation, Writing -
constituted a new MSAC framework that could improve the optimization Review & Editing. Bin Zhang: Investigation, Writing - Review & Editing.
effect through reward bootstrapping. Meanwhile, the prioritized expe Guodong Du: Investigation, Writing - Review & Editing.
rience replay technique is used to improve the learning rate and fuel
Fig. 13. 1130 s low speed verification cycle.
12
W. Sun et al. Energy 258 (2022) 124806
Declaration of competing interest [16] Guo N, Lenzo B, Zhang X, Zou Y, Zhai R, Zhang T. A real-time nonlinear model
predictive controller for yaw motion optimization of distributed drive electric
vehicles. IEEE Trans Veh Technol May 2020;69(5):4935–46.
The authors declare that they have no known competing financial [17] Guo N, Zhang X, Zou Y, Guo L, Du G. Real-time predictive energy management of
interests or personal relationships that could have appeared to influence plug-in hybrid electric vehicles for coordination of fuel economy and battery
the work reported in this paper. degradation. Energy Jun. 2021;214.
[18] Wu J, Wei Z, Liu K, Quan Z, Li Y. Battery-involved energy management for hybrid
electric bus based on expert-assistance deep deterministic policy gradient
Data availability algorithm. IEEE Trans Veh Technol Nov. 2020;69(11):12786–96.
[19] Tulpule P, Marano V, Rizzoni G. Effect of traffic, road and weather information on
PHEV energy management. In: SAE technical paper0148-7191; 2011.
The data that has been used is confidential. [20] Zhang X, Guo L, Guo N, Zou Y, Du G. Bi-Level energy management of plug-in
hybrid electric vehicles for fuel economy and battery lifetime with intelligent state-
Acknowledgment of-charge reference. J Power Sources 2021;481.
[21] Liu T, Zou Y, Liu D, Sun F. Reinforcement learning-based energy management
strategy for a hybrid electric tracked vehicle. Energies 2015;8(7):7243–60.
This work is supported by the National Key Research and Develop [22] He D, Zou Y, Wu J, Zhang X, Zhang Z, Wang R. Deep Q-learning based energy
ment Program of China (2021YFB2500900), and in part by the National management strategy for a series hybrid electric tracked vehicle and its
adaptability validation. 2019 IEEE Transportation Electrification Conference and
Natural Science Foundation of China (51775039). The authors would Expo (ITEC); 2019. p. 1–6.
like to thank them for their support and help. In addition, the authors [23] Wu J, He H, Peng J, Li Y, Li Z. Continuous reinforcement learning of energy
also would like to thank the reviewers for their corrections and helpful management with deep Q network for a power split hybrid electric bus. Appl
Energy 2018;222:799–811.
suggestions.
[24] Lee W, Jeoung H, Park D, Kim T, Lee H, Kim N. A real-time intelligent energy
management strategy for hybrid electric vehicles using reinforcement learning.
References IEEE Access Jan. 2021;9:72759–68.
[25] Li Y, He H, Peng J, Wang H. Deep reinforcement learning-based energy
[1] Martinez CM, Hu X, Cao D, Velenis E, Gao B, Wellers M. Energy management in management for a series hybrid electric vehicle enabled by history cumulative trip
plug-in hybrid electric vehicles: recent progress and a connected vehicles information. IEEE Trans Veh Technol 2019;68(8):7416–7430, Aug.
perspective. IEEE Trans Veh Technol Jun. 2017;66(6):4534–49. [26] Li J, Wu X, Hu S, et al. A deep reinforcement learning based energy management
[2] Siang FT, Chee WT. A review of energy sources and energy management system in strategy for hybrid electric vehicles in connected traffic environment[J]. IFAC-
electric vehicles. Renew Sustain Energy Rev Apr. 2013;20:82–102. PapersOnLine 2021;54(10):150–6.
[3] Salmasi FR. Control strategies for hybrid electric vehicles: evolution, classification, [27] Wu Y, Tan H, Peng J, et al. Deep reinforcement learning of energy management
comparison, and future trends. IEEE Trans Veh Technol 2007;56(5):2393–2404, with continuous control strategy and traffic information for a series-parallel plug-in
Sept. hybrid electric bus. Appl Energy 2019;247:454–66.
[4] Padmarajan BV, McGordon A, Jennings PA. Blended rule-based energy [28] Zhou J, Xue S, Xue Y, Liao Y, Liu J, Zhao W. A novel energy management strategy
management for PHEV: system structure and strategy. IEEE Trans Veh Technol of hybrid electric vehicle via an improved TD3 deep reinforcement learning.
2016;65(10):8757–8762, Oct. Energy 2021;224:120118. Jun.
[5] Banvait H, Anwar S, Chen Y. A rule-based energy management strategy for Plug-in [29] Wu J, Wei Z, Li W, Wang Y, Li Y, Sauer DU. Battery thermal- and health-
Hybrid Electric Vehicle (PHEV). In: 2009 American control conference; 2009. constrained energy management for hybrid electric bus based on soft actor-critic
p. 3938–43. DRL algorithm. IEEE Trans Ind Inf 2021;17(6):3751–61.
[6] Lagorse J, Simoes MG, Miraoui A. A multiagent fuzzy-logic-based energy [30] Zhang F, Xi J, Langari R. Real-time energy management strategy based on velocity
management of hybrid systems. IEEE Trans Ind Appl -dec. 2009;45(6):2123–2129, forecasts using V2V and V2I communications. IEEE Trans Intell Transport Syst Feb.
Nov. 2017;18(2):416–30.
[7] Du G, Zou Y, Zhang X, Guo L, Guo N. Heuristic energy management strategy of [31] Hu X, Liu T, Qi X, Barth M. Reinforcement learning for hybrid and plug-in hybrid
hybrid electric vehicle based on deep reinforcement learning with accelerated electric vehicle energy management: recent advances and prospects. IEEE
gradient optimization. IEEE Transactions on Transportation Electrification 2021;7 Industrial Electronics Magazine Sept. 2019;13(3):16–25.
(4):2194–2208, Dec. [32] Wei Z, Dong G, Zhang X, Pou J, Quan Z, He H. Noise-immune model identification
[8] Zhou W, Yang L, Cai Y, Ying T. Dynamic programming for new energy vehicles and state-of-charge estimation for lithium-ion battery using bilinear
based on their work modes Part II: fuel cell electric vehicles. J Power Sources Dec. parameterization. IEEE Trans Ind Electron Jan. 2021;68(1):312–23.
2018;407:92–104. [33] Liu T, Hu X, Hu W, Zou Y. A heuristic planning reinforcement learning-based
[9] Zhou W, Yang L, Cai Y, Ying T. Dynamic programming for new energy vehicles energy management for power-split plug-in hybrid electric vehicles. IEEE Trans Ind
based on their work modes part I: electric vehicles and hybrid electric vehicles. Inf 2019;15(12):6436–6445, Dec.
J Power Sources Dec. 2018;406:151–66. [34] Haarnoja Tuomas, et al. Soft actor-critic: off-policy maximum entropy deep
[10] Tribioli L, Cozzolino R, Chiappini D, Iora P. Energy management of a plug-in fuel reinforcement learning with a stochastic actor.” International conference on
cell/battery hybrid vehicle with on-board fuel processing. Appl Energy Dec. 2016; machine learning. PMLR; 2018.
184:140–54. [35] Haarnoja Tuomas, et al. Soft actor-critic algorithms and applications. In: arXiv
[11] Xie S, Hu X, Xin Z, Brighton J. Pontryagin’s minimum principle based model preprint arXiv:1812.05905; 2018.
predictive control of energy management for a plug-in hybrid electric bus. Appl [36] Vieillard Nino, Pietquin O, Geist M. Munchausen reinforcement learning. Adv
Energy Feb. 2019;236:893–905. Neural Inf Process Syst 2020;33:4235–46.
[12] Chen Z, Mi C, Xiong R, Xu J, You C. Energy management of a power-split plug-in [37] Hou Y, Liu L, Wei Q, Xu X, Chen C. A novel DDPG method with prioritized
hybrid electric vehicle based on genetic algorithm and quadratic programming. experience replay. In: 2017 IEEE international conference on systems, man, and
J Power Sources Feb. 2014;248:416–26. cybernetics (SMC); 2017. p. 316–21.
[13] Lei Z, Qin D, Hou L, Peng J, Liu Y, Chen Z. An adaptive equivalent consumption [38] Schaul Tom, et al. Prioritized experience replay. 2015. arXiv preprint arXiv
minimization strategy for plug-in hybrid electric vehicles based on traffic 1511.05952.
information. Energy Jan. 2020;190. [39] Zhang B, Mi CC, Zhang M. Charge-depleting control strategies and fuel
[14] Kazemi H, Fallah Y, Nix A, Wayne S. Predictive AECMS by utilization of intelligent optimization of blended-mode plug-in hybrid electric vehicles. IEEE Trans Veh
transportation systems for hybrid electric vehicle powertrain control. IEEE Trans. Technol May 2011;60(4):1516–25. https://doi.org/10.1109/TVT.2011.2122313.
Intell. Veh. Jun. 2017;2(2):75–84. [40] Zhang Shuo, et al. Adaptively coordinated optimization of battery aging and
[15] Guo N, Zhang X, Yuan Zou, Du G, Wang C, Guo L. Predictive energy management energy management in plug-in hybrid electric buses. Appl Energy 2019;256:
of plug-in hybrid electric vehicles by real-time optimization and data-driven 113891.
calibration. IEEE Transactions on Vehicular Technology, early access Dec. 2021.
https://doi.org/10.1109/TVT.2021.3138440.
13

1 s2.0 S0360544222017091 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0360544222017091 Main

Uploaded by

Copyright:

Available Formats

Energy 258 (2022) 124806

Contents lists available at ScienceDirect

High robustness energy management strategy of hybrid electric vehicle

1. Introduction expandability [7]. Optimization-based methods use the optimization

algorithms. The policy network of deterministic algorithms outputs a

2.2. Lithium-ion battery model

The derivative of the SOC is expressed as follows.

2.3. Power request model

According to the dynamic formula of the tracked vehicle, the tracked

2.4. Energy management problem formulation 3.1. Formulation of SAC-DRL algorithm

it as the control variable, which is given by:

3. Energy management strategy based on improved soft actor-

Fig. 5. Driving cycle for training the NN.

(2) The influence of PER

The training loss curves of MSAC and MSAC(PER) are depicted in

Fig. 8. Power distributions of the three methods.

the battery of MSAC(PER) is closer to that of DP, while MSAC’s is much

Methods Average generator Average battery Maximum battery

DP 4.77 kW − 3.66 kW − 12.87 kW

Although the result of DP-assisted training provides little improve­

The performance of the proposed method, SAC-, and DDPG-based

Fuel consumption(g) 678.06 660.30 691.00 705.31 694.00 655.46

Fig. 11. Distributions of engine working points of three DRL-based EMSs.

Fig. 12. Four drive cycles with random Gaussian noise.

5. Conclusion Wenjing Sun: Conceptualization, Methodology, Software, Formal

Fig. 13. 1130 s low speed verification cycle.

You might also like

Although the result of DP-assisted training provides little improve