Professional Documents
Culture Documents
1 s2.0 S0360544222017091 Main
1 s2.0 S0360544222017091 Main
Energy
journal homepage: www.elsevier.com/locate/energy
A R T I C L E I N F O A B S T R A C T
Keywords: As a hybrid electric vehicle (HEV) key control technology, intelligent energy management strategies (EMSs)
Energy management strategy directly affect fuel consumption. Investigating the robustness of EMSs to maximize the advantages of energy
Deep reinforcement learning savings and emission reduction in different driving environments is necessary. This article proposes a soft actor-
Soft actor critic
critic (SAC) deep reinforcement learning (DRL) EMS for hybrid electric tracked vehicles (HETVs). Munchausen
Munchausen reinforcement learning
Prioritized experience replay
reinforcement learning (MRL) is adopted in the SAC algorithm, and the Munchausen SAC (MSAC) algorithm is
constructed to achieve lower fuel consumption than the traditional SAC method. The prioritized experience
replay (PER) is proposed to achieve more reasonable experience sampling and improve the optimization effect.
To enhance the “cold start” performance, a dynamic programming (DP)-assisted training method is proposed that
substantially improves the training efficiency. The proposed method optimization result is compared with the
traditional SAC and deep deterministic policy gradient (DDPG) with PER through the simulation. The result
shows that the proposed strategy improves both fuel consumption and possesses excellent robustness under
different driving cycles.
* Corresponding author.
E-mail addresses: sunw-enjing@163.com (W. Sun), zouyuanbit@vip163.com (Y. Zou), xudong.zhang@bit.edu.cn (X. Zhang), gny123@qq.com (N. Guo),
guodongdu_robbie@163.com (G. Du).
https://doi.org/10.1016/j.energy.2022.124806
Received 19 March 2022; Received in revised form 15 June 2022; Accepted 11 July 2022
Available online 14 July 2022
0360-5442/© 2022 Published by Elsevier Ltd.
W. Sun et al. Energy 258 (2022) 124806
2
W. Sun et al. Energy 258 (2022) 124806
⎧ ( )
⎪
⎨ Teng − Tg = 2π Je + Jg dng
⎪
ieg 60 i2eg dt (2)
⎪
⎪
⎩
ng = ieg neng
where Teng is the torque of the engine and neng is the speed of the engine.
ieg is the transmission ratio between the engine and generator. In this
vehicle, the engine and generator are directly connected by coupling, so
ieg is 1. Je and Jg stand for the moment of inertia of the engine and
generator.
Fig. 2. Equivalent circuit of the EGS. In this paper, the internal resistance model is used to model the
lithium-ion battery system [32], and the equivalent circuit diagram of
the battery is shown in Fig. 3.
According to the equivalent circuit, the output voltage Ub and the
output power Pb of the battery are formulated as follows.
{
Ub = Vbat − Ib Rbat
(3)
Pb = Vbat Ib − Ib2 Rbat
where Vbat is the open circuit voltage of the battery, which changes as
the SOC changes. Rbat is the internal resistance of the battery; according
to the discharging and charging status of the battery, its value is equal to
Rd or Rc . Cb is the total battery capacity.
3
W. Sun et al. Energy 258 (2022) 124806
In this paper, we use the DRL framework to solve the energy man The most important feature of SAC is the use of information entropy.
agement problem. According to the basic framework of DRL, state var Information entropy is a measure of the randomness of a strategy. A
iables, the control variable, the reward function, and the constraint larger entropy implies higher randomness of the strategy. The high
condition are set as follows. randomness is obtained by more exploration, which allows the algo
The vehicle state is described by the state of the powertrain and the rithm to learn more ways to compute the optimal solution. Therefore,
state of motion [33]. We define the power demand Pdem and speed v to the algorithm has more output possibilities in the face of perturbations,
represent the state of motion. The energy of the vehicle is provided by guaranteeing sufficient robustness. The SAC makes a trade-off between
the powertrain, consisting of the EGS and the battery. Therefore, we maximizing expected rewards and maximizing entropy. This makes the
define the speed of the generator ng to describe the state of the EGS and policy robust while ensuring optimality. The best policy function of SAC
define SOC to describe the state of the battery. A 4-D vector is used to can be expressed as follows [34]:
describe the state variable, which is expressed as follows. ∑
π* = argmaxЕ(st ,at )∼ρπ [ R(st , at ) + αH(π ( ⋅ |st ))] (12)
π
st = {Pdem , v, SOC, ng }T (8) t
In this series HETV, without any DC/DC converter or driving where the first part is the reward, which is the same as the traditional
component, the output power of the generator is coupled with the bat DRL, and the second part is the entropy, which is unique to SAC. The
tery directly, which reduces the control difficulty. Because the throttle of definition of information entropy is as follows:
the engine can determine the engine output torque, it directly affects the
H(P) = Е [ − log P(x)] (13)
fuel consumption and power distribution of the powertrain, so we select x∼P
where frate (t) is the instantaneous fuel consumption rate and soctar is the exp(α1 Q(st , ⋅))
(15)
′
π = arg min
∏DKL (πk ( ⋅ |st )||
target value of soc. In this paper, we define it as soctar = soc(t0 ). k1 and k2 πk ∈ Z(st )
are the weighting factors of the fuel consumption and battery SOC offset,
∏
respectively. where is the policy set, DKL is the KL divergence, and Z(st ) is a loga
To ensure the accuracy and rationality of the EMS solution, the rithm partition function that is used to normalize the distribution. It is a
following parameters in the vehicle model also need to meet the hard constant that does not affect gradient descent.
constraints of the physical system: The SAC algorithm is composed of three deep neural networks
⎧ (DNNs), a policy network π to generate policy, and two value networks Q
⎪ socmin ≤ soc(t) ≤ socmax
⎪
⎪
⎪ to generate the Q value. For faster and more stable training, SAC chooses
⎨ Pb,min ≤ Pb (t) ≤ Pb,max
ng,min ≤ ng (t) ≤ ng,max (11) the smaller Q value of the two value networks as the target Q value each
⎪
⎪
⎪ Ig,min ≤ Ig (t) ≤ Ig,min time. To mitigate the overestimation problem, the training process of the
⎪
⎩
Teng,min ≤ Teng (t) ≤ Teng,max two value networks also uses two target value networks Q’.
According to the optimal Bellman equation, the update of the value
network is given by:
2
1
(16)
′
JQ (θ) = E(st ,at ,st+1 )∼D,at+1 ∼πφ [ (Qθ (st , at ) − (rt + γ(Qθ (st+1 , at+1 ) − α log(π φ (at+1 |st+1 ))))) ]
2
first time achieves a significant breakthrough through small changes in θQ = (1 − τ)θQ + τθQ (17)
the control effect. The PER achieves more reasonable experience sam
pling and improves the optimization effect, and the results of DP are where τ is the weighting factor of the soft update per step. The update of
innovatively used to assist the training process, which greatly acceler the policy network can be expressed as follows:
ates the initial convergence. 1
Jπ (φ) = Est ∼D,at ∼πφ [log πφ (at |st ) − Qθ (st , at ) + log Z(st )] (18)
α
The temperature hyperparameter α adopts an automatic adjustment
4
W. Sun et al. Energy 258 (2022) 124806
Fig. 4. Complete framework of the proposed DRL method for energy management control.
method. This method is based on the idea that the policy should keep the where fφμ (st ) is the mean value, fφσ (st ) is the variance, and εt is the random
entropy greater than a threshold while maximizing the expected reward.
noise that follows the normal distribution.
The update of α can be expressed as follows:
J(α) = Eat ∼πt [ − α log πt (at |st ) − αН 0 ] (19)
3.2. Munchausen Reinforcement learning method
where Н 0 is the minimum policy entropy threshold.
Inspired by information entropy, this paper innovatively combines
Because SAC is a stochastic algorithm, the action is not directly
the SAC method with the MRL method for the first time and constructs
output by the policy network but through the sampling. The action
the MSAC framework. This can greatly speed up convergence and
function is defined as:
improve optimization effects through minor changes. MRL is a bootstrap
at = fφ (εt ; st ) = fφμ (st ) + εt ⊙ fφσ (st ) (20) method using the current policy. It constructs the Munchausen reward
by adding log-policy to real-time reward for optimization [36].
5
W. Sun et al. Energy 258 (2022) 124806
Table 2 Table 3
The pseudocode of the proposed method. Key hyperparameters of the MSAC framework.
Algorithm: DP-assisted MSAC with prioritized experience replay Parameter Value
1: Randomly initialize the network parameters ϕ, θ1, θ2 Size of the replay buffer ND 100,000
2: Initialize the weight parameters of the target value network θ1’ = θ1, θ2’ = θ2 Batch size nb 64
3: Initialize the replay buffer, hyperparameters Initial temperature parameter α0 0.2
4: for episode ¼ 1 to H, do: Initial learning rates lar lcr , lαr 0.001
5: Resetting initialize state to s0 = {Pdem0, v0, SOC0, ng0}T Soft update factor τ 0.005
6: for t ¼ 1 to total steps, do: Discount factor γ 0.99
7: if episode < DP assisted training episodes, do: Prioritization factor αPER 0.6
8: Getting control action from DP solution: at = at_optimal Compensation factor βPER 0.4
9: else, do:
10: Generating control action applying proposed method at = πϕ (at|st)
11: end if
12: Obtain Munchausen reward rt and next step state st+1 pi α
P(i) = (24)
13: Store transition (st, at, rt, st+1) in replay buffer and set priority of experience pi ∑
k
14: if t > memory size of replay buffer ND, do: pαk
i=1
15: for i ¼ 1 to batch size nb, do:
16: Get experience i with probability Pi
17: Calculate importance sampling weight ωi where α is the prioritization factor, which determines how much pri
18: Compute TD-error and update the priority of experience i according to TD- oritization is used, and α = 0 corresponds to uniform sampling. pi > 0 is
error the priority of experience i and can be expressed as:
19: end for
20: Update the value networks: θi = θi − lcr ωi ∇JQ (θi ) for i ∈ {1, 2} pi = |δi | + ε (25)
21: Update policy network: ϕi = ϕi − lar ∇Jπ (ϕ)
22: Update temperature hyperparameter: α = α − lαr ∇J(α) where |δi | is the TD error and ε is a small positive constant that prevents
Update the target value networks: θi Q = (1 − τ)θi Q + τθi Q for i ∈ {1, 2} the edge case of experience from being revisited once their TD error is
′ ′
23:
24: end if zero.
25: end for
26: end for
Due to the random priority sampling of PER, a bias is inevitably
introduced. To correct the bias, the importance sampling weight is used:
1 1 β
ωi = ( ⋅ ) (26)
N P(i)
rM (st , at ) = r(st , at ) + log(π(at |st )) (21)
MRL can lead bootstrapping based on the fact that current policy can where β is the factor of compensation. These weights can be folded into
impact the next action. If the optimal policy is known, π(at |st ) = 1. The the Q-learning update by using ωi ∇JQ instead of ∇JQ .
log policy is 0, and all other suboptimal log policies are negative infinity.
This is a strong signal to guide the algorithm to suppress suboptimal 3.4. Dynamic programming assisted training method
solutions, and its addition to the reward does not change the optimal
policy. In the initial training process of SAC, the agent outputs actions
The optimal Bellman equation and the update of the Q network based through random exploration, which is often on the boundary. For
on the MSAC become: example, the action is 0, which means that the throttle of the engine is 0,
the engine stops and the power is provided entirely by the battery, so the
Q(st , at ) = rM (st , at ) + γ⋅Еst+1 ,at+1 [Q(st+1 , at+1 ) − α log(π (at+1 |st+1 ))] (22)
fuel consumption is 0. Alternatively, the action is 1, where the engine is
operating at maximum speed and fuel consumption is highest. Due to the
1
JQ (θ) = E(st ,at ,st+1 )∼D,at+1 ∼πφ [ (Qθ (st , at ) − (rM (st , at ) large state space, completely random exploration makes the rewards
2 (23) very sparse, and the problem of gradient disappearance often occurs
+γ(Qθ′ (st+1 , at+1 ) − α log(π φ (at+1 |st+1 )))))2 ] when sigmoid and tanh functions are used [18]. In this process, the
It should be mentioned that information entropy in SAC is to subtract agent learns very slowly or even cannot learn. This is called the “cold
all next Q values of the next decision from log-policy, while MRL is based start.” To speed up the training process and alleviate the problem of cold
on the judgment of the current action. start, this paper uses the optimal solution of DP to assist the training
process.
3.3. Prioritized experience replay Before the training process begins, the global optimal solution of the
training cycle is solved by DP, and the optimal action sequence is
SAC, as an off-policy strategy, stores historical experiences in a derived. During the first few rounds of training, agents do not randomly
replay buffer to remember and reuse them. Normally, SAC obtains explore actions but train with the optimal action sequence. However, to
experience through uniform sampling. However, in this method, each ensure the robustness of the strategy and information entropy can be
experience is sampled with the same probability, and the importance of fully considered, after the “cold start” stage, that is, when the decline
each experience is ignored, resulting in low sampling efficiency [37]. rate of loss slows down, the assisted training is stopped, and the action
Previous studies have found that TD error can be used as a measurement generated by the agent is used to continue the training until conver
index to determine the importance of every experience, but greedy gence. The method of DP-assisted training can completely avoid the
preferential sampling based solely on the size of TD error will lead to problem of “cold start” guide the rapid decline of loss in the early stage
overfitting of the algorithm. To improve the sampling efficiency and of training, and ensure the robustness and optimality of the strategy,
improve the accuracy of the algorithm, PER is proposed in this paper which can greatly shorten the training time.
using random priority sampling and annealing the bias to avoid this
problem. Random priority sampling combines greedy preferential 3.5. Complete control framework of the MSAC
sampling with uniform random sampling. The probability of sampling
experience is defined as [38]: The proposed MSAC strategy is shown schematically in Fig. 4, and
the pseudocode of the proposed method is demonstrated in Table 2. It is
composed of three deep neural networks (DNNs), which constitute fully
6
W. Sun et al. Energy 258 (2022) 124806
connected layers, including 2 hidden layers with 256 nodes each. The The training cycle is 1130 s in total, and the speed trajectory is shown
Adam optimizer is used for updating the networks, and the associated in Fig. 5. The data of the training cycle are collected by a real vehicle,
hyperparameters are given in Table 3. which is a typical working cycle of the vehicle model in this paper. In the
training stage, the initial state variable of SOC is set as 0.75, and the
4. Simulation results and discussion generator speed is set as 2200 rpm.
4.1. Validation of the training process (1) The influence of the MRL
In this section, the global optimal solution of DP is used as the The reward curves of the MSAC and SAC are shown in Fig. 6. The
optimal reference benchmark. The convergence speed and fuel optimi rewards of the MSAC rise rapidly, and the reward value of the conver
zation effect of the proposed method are compared with those of other gence point of the MSAC is higher than that of the SAC. Under the same
SAC-based EMSs. As a very popular DRL in the field of EMS, DDPG is constraint function, a higher reward means lower fuel consumption and
used as a comparative strategy to verify the fuel economy and robustness better return of SOC to the target value. This is also demonstrated by the
of the proposed method. To make the comparison fairer, we also use PER performance of MSAC and SAC in Table 6. In the convergence process
for DDPG. To compare the fuel economy of various methods more fairly (especially 20 to 90 episodes), the average reward change between two
and objectively, we use the linear regression method [39] to obtain the adjacent episodes for the MSAC and SAC is 587.01 and 750.13,
equivalent fuel consumption. The equivalent fuel consumption is the respectively. This means that the reward curve of the MSAC rises more
fuel consumption obtained when the final value of soc is regressed to gently. The convergence point of the MSAC is at 89 episodes, which is
75%. less than that of the SAC. This is because the update of the MSAC’s Q
network considers the influence of the current policy, which makes the
gradient descent faster and smoother and thus can converge to the
optimal solution more quickly. In summary, the combination of the MRL
method and SAC to build the MSAC framework is effective in improving
the optimization effect and convergence speed.
7
W. Sun et al. Energy 258 (2022) 124806
8
W. Sun et al. Energy 258 (2022) 124806
optimal action trajectory. However, the value of the loss function is still
far lower than that of the MSAC(PER) without assistance at this stage. 4.2. Robustness verification of the proposed method
From the enlarged figure, the loss of DP-assisted training is lower than
that of PER in the same training episode, which also indicates that the The above verification assumes that the actual driving cycle can be
optimal solution of DP can guide the agent to converge to the global obtained in advance and used for training. However, the actual driving
optimum. cycle will inevitably have errors with the predicted training driving
cycle [40]. In this paper, artificial Gaussian noises of different intensities
Table 5 are added to the training cycle as the actual cycle to verify the robustness
DP-assisted training effect of the SAC framework. and adaptability of the proposed EMS, which is trained by the training
Methods Training Converge Time consumption
cycle in Fig. 5. The Gaussian noises can be produced by the following
Episodes Episodes (s) function in the MATLAB library:
SAC / 95 1531 vnew = vorig + wgn (length(vorig ), 1, m) (27)
DP + SAC 15 75 1199
MSAC / 89 1441
where vnew is the new drive velocity with added Gaussian noises. vorig is
DP + MSAC 12 71 1139
MSAC (PER) / 80 1303 the original drive velocity. wgn( ⋅) is a function in the MATLAB library
DP + MSAC 15 66 1065 used to generate Gaussian noise. vorig is the length of the driving cycle,
(PER) and m denotes the noise intensity to simulate the complexity of driving
conditions. The locally weighted scatterplot smoothing function in the
Table 6
Performance of various EMSs.
Algorithm Proposed method MSAC (PER) MSAC SAC DDPG (PER) DP
9
W. Sun et al. Energy 258 (2022) 124806
MATLAB library is also introduced to smooth the speed profile: while DDPG(PER) is 20.25%, indicating that the performance of the
proposed method is more stable under different noises. Even under a 25
(28)
′ ′
v′new = smooth(vnew , n, lowess ) dBw noise level, the fuel consumption of the proposed method is only
12% higher than that of DP, which is much lower than the 20% of DDPG
where smooth( ⋅) is the smoothing function and n is the window width,
(PER). This shows that the proposed method can still guarantee
which can affect the smoothness of the speed schedule. lowless is a
robustness in the face of high error. Moreover, the deviation of the final
parameter specifying the local regression method for smoothing data.
value SOC and the target SOC of the proposed method is basically within
Four groups of different parameters in Table 7 are employed in Eqs. (27)
2%. However, the deviation of DDPG(PER) is 3%–4%. Therefore, the
and (28), and the new test cycles are yielded in Fig. 12.
proposed method has great robustness under the test cycle with different
As a comparison strategy, DDPG(PER) does not use DP-assisted
noise levels.
training. We also attempted to use this method for DDPG(PER), but it
Neither of these two methods shows a linear relationship in which
does not seem to work. However, we assume that DP-assisted training
the larger the noise is, the worse the optimization effect. However, with
has no impact on robustness validation. Its main significance is to alle
the increase in the noise, the changing trend of the proposed method’s
viate the “cold start” problem and accelerate convergence. After assisted
relative increase in DP is different from DDPG(PER). We think the reason
training, the proposed method still has a large number of episodes to
for this phenomenon is related to the exploration of the training process.
explore according to the information entropy. From another point of
The significance of the exploration in the training process is to help the
view, the problem of DP itself is that it has poor adaptability and
algorithm find the global optimum. At the same time, the exploration
robustness. Using a fixed DP optimal solution to train the algorithm
helps the algorithm obtain as many solutions as possible to complete the
should also be negative if there is an impact on the robustness of the
specified task, thereby improving the robustness in the face of noise.
algorithm. However, our verification does not show such a problem.
However, the exploration direction is random. We have not identified
The results of the proposed method and DDPG(PER) are summarized
which noise the method faces will result in better performance. There
in Table 8. It should be mentioned that the relative increase is compared
fore, the optimization effect of the same method will fluctuate when
with the equivalent fuel consumption of the DP optimal solution of each
faced with different errors, and the change trends of the optimization
cycle. The average relative increase of the proposed method is 8.75%,
effect of different methods are also different. But more exploration
means more learning, which is the reason why the proposed method
Table 7 outperforms DDPG(PER) in each cycle.
Parameters for generating new test cycles with random Gaussian noise. If the V2X signal is limited and the vehicle cannot predict the future
Driving schedule m[dBw] N driving cycle, the strategy trained by the typical training cycle in Fig. 5
requires further verification of adaptability. We choose another typical
Cycle 1 10 10
Cycle 2 15 10
driving cycle collected by a real vehicle, as shown in Fig. 13. This cycle is
Cycle 3 20 20 a low-speed cycle with an average speed of 14.07 km/h and a maximum
Cycle 4 25 20 speed of 30.74 km/h (the average speed of the training condition is
21.87 km/h and 39.69 km/h). The results of the proposed method and
DDPG(PER) are summarized in Table 9.
10
W. Sun et al. Energy 258 (2022) 124806
11
W. Sun et al. Energy 258 (2022) 124806
Table 8 Table 9
Performance under four drive cycles with Gaussian noise. Performance under a low-speed verification cycle.
Algorithm Driving Fuel Terminal Equivalent Relative Algorithm Fuel Terminal Equivalent fuel Relative
cycle consumption SOC fuel increase consumption SOC consumption (g) increase of
(g) consumption of DP (g) DP
(g)
Proposed 592.93 75.63% 574.03 13.5%
Proposed Cycle 1 656.43 74.11% 683.13 6% method
method Cycle 2 656.80 73.25% 708.81 10% DDPG(PER) 673.18 74.81% 678.88 34.2%
Cycle 3 626.77 73.21% 680.47 7% DP 512.05 75.21% 505.75 /
Cycle 4 692.00 75.46% 678.22 12%
DDPG Cycle 1 693.15 71.98% 783.75 19%
(PER) Cycle 2 676.56 71.77% 773.46 18%
Cycle 3 698.58 71.78% 791.96 24%
optimality. Furthermore, the optimal solution of DP is directly used for
Cycle 4 658.65 72.45% 734.64 20%
training to alleviate the cold start problem caused by random explora
tion. Comparative studies with existing state-of-the-art techniques have
been performed in terms of fuel optimality, training effort, robustness,
Based on the above studies on robustness and adaptability, the pro and adaptability. The major conclusions are drawn as follows.
posed method has obvious advantages compared with DDPG (PER),
caused by the following three aspects: (1) The performance of the proposed method is compared with SAC
and DDPG(PER) to verify its superiority. Attributed to the
(1) DDPG-based DRL training ends as long as an optimal path is participation of the MRL method, PER and DP-assisted training,
found in the process of neural network learning. However, the the results indicated that the proposed EMS obtained the best fuel
application of information entropy under SAC-based DRL re optimality (only increases 5.13% relative to DP) and fastest
quires the algorithm not only to consider the optimal solution but convergence speed (improved 30.5% compared with SAC).
also to explore all possible optimal paths and various optimal (2) The MRL method and PER can improve fuel optimization and
possibilities in different ways. Therefore, it is easier to adjust and speed up the training process at the same time. Moreover, DP-
find the optimal solution in the face of noise. assisted training can greatly improve the training speed and
(2) As a stochastic algorithm, the policy network of SAC outputs the accelerate the decrease of loss in the initial training, but it does
probability distribution of an action. The action of SAC needs to not greatly improve fuel consumption.
be sampled. As a deterministic algorithm, DDPG outputs a defi (3) Compared with DDPG, the proposed method has very high
nite action value. This feature not only allows SAC to explore robustness and adaptability under different verification cycles. It
better during training but also provides more possibilities for the is 8.75% higher than DP on average in error conditions and
algorithm when faced with errors. 13.5% higher than DP in low-speed verification conditions.
(3) The training of DDPG requires many more hyperparameters to be
adjusted than that of SAC. To encourage exploration of DDPG, OU Our future research will focus on real-world vehicle validations and
error parameters usually need to be adjusted manually. However, the consideration of platoon control. Moreover, the analysis and com
hyperparameters such as the temperature coefficient, which parison of more EMSs are worth completing.
controls the exploration degree of SAC, are adjusted
automatically. Credit author statement
12
W. Sun et al. Energy 258 (2022) 124806
Declaration of competing interest [16] Guo N, Lenzo B, Zhang X, Zou Y, Zhai R, Zhang T. A real-time nonlinear model
predictive controller for yaw motion optimization of distributed drive electric
vehicles. IEEE Trans Veh Technol May 2020;69(5):4935–46.
The authors declare that they have no known competing financial [17] Guo N, Zhang X, Zou Y, Guo L, Du G. Real-time predictive energy management of
interests or personal relationships that could have appeared to influence plug-in hybrid electric vehicles for coordination of fuel economy and battery
the work reported in this paper. degradation. Energy Jun. 2021;214.
[18] Wu J, Wei Z, Liu K, Quan Z, Li Y. Battery-involved energy management for hybrid
electric bus based on expert-assistance deep deterministic policy gradient
Data availability algorithm. IEEE Trans Veh Technol Nov. 2020;69(11):12786–96.
[19] Tulpule P, Marano V, Rizzoni G. Effect of traffic, road and weather information on
PHEV energy management. In: SAE technical paper0148-7191; 2011.
The data that has been used is confidential. [20] Zhang X, Guo L, Guo N, Zou Y, Du G. Bi-Level energy management of plug-in
hybrid electric vehicles for fuel economy and battery lifetime with intelligent state-
Acknowledgment of-charge reference. J Power Sources 2021;481.
[21] Liu T, Zou Y, Liu D, Sun F. Reinforcement learning-based energy management
strategy for a hybrid electric tracked vehicle. Energies 2015;8(7):7243–60.
This work is supported by the National Key Research and Develop [22] He D, Zou Y, Wu J, Zhang X, Zhang Z, Wang R. Deep Q-learning based energy
ment Program of China (2021YFB2500900), and in part by the National management strategy for a series hybrid electric tracked vehicle and its
adaptability validation. 2019 IEEE Transportation Electrification Conference and
Natural Science Foundation of China (51775039). The authors would Expo (ITEC); 2019. p. 1–6.
like to thank them for their support and help. In addition, the authors [23] Wu J, He H, Peng J, Li Y, Li Z. Continuous reinforcement learning of energy
also would like to thank the reviewers for their corrections and helpful management with deep Q network for a power split hybrid electric bus. Appl
Energy 2018;222:799–811.
suggestions.
[24] Lee W, Jeoung H, Park D, Kim T, Lee H, Kim N. A real-time intelligent energy
management strategy for hybrid electric vehicles using reinforcement learning.
References IEEE Access Jan. 2021;9:72759–68.
[25] Li Y, He H, Peng J, Wang H. Deep reinforcement learning-based energy
[1] Martinez CM, Hu X, Cao D, Velenis E, Gao B, Wellers M. Energy management in management for a series hybrid electric vehicle enabled by history cumulative trip
plug-in hybrid electric vehicles: recent progress and a connected vehicles information. IEEE Trans Veh Technol 2019;68(8):7416–7430, Aug.
perspective. IEEE Trans Veh Technol Jun. 2017;66(6):4534–49. [26] Li J, Wu X, Hu S, et al. A deep reinforcement learning based energy management
[2] Siang FT, Chee WT. A review of energy sources and energy management system in strategy for hybrid electric vehicles in connected traffic environment[J]. IFAC-
electric vehicles. Renew Sustain Energy Rev Apr. 2013;20:82–102. PapersOnLine 2021;54(10):150–6.
[3] Salmasi FR. Control strategies for hybrid electric vehicles: evolution, classification, [27] Wu Y, Tan H, Peng J, et al. Deep reinforcement learning of energy management
comparison, and future trends. IEEE Trans Veh Technol 2007;56(5):2393–2404, with continuous control strategy and traffic information for a series-parallel plug-in
Sept. hybrid electric bus. Appl Energy 2019;247:454–66.
[4] Padmarajan BV, McGordon A, Jennings PA. Blended rule-based energy [28] Zhou J, Xue S, Xue Y, Liao Y, Liu J, Zhao W. A novel energy management strategy
management for PHEV: system structure and strategy. IEEE Trans Veh Technol of hybrid electric vehicle via an improved TD3 deep reinforcement learning.
2016;65(10):8757–8762, Oct. Energy 2021;224:120118. Jun.
[5] Banvait H, Anwar S, Chen Y. A rule-based energy management strategy for Plug-in [29] Wu J, Wei Z, Li W, Wang Y, Li Y, Sauer DU. Battery thermal- and health-
Hybrid Electric Vehicle (PHEV). In: 2009 American control conference; 2009. constrained energy management for hybrid electric bus based on soft actor-critic
p. 3938–43. DRL algorithm. IEEE Trans Ind Inf 2021;17(6):3751–61.
[6] Lagorse J, Simoes MG, Miraoui A. A multiagent fuzzy-logic-based energy [30] Zhang F, Xi J, Langari R. Real-time energy management strategy based on velocity
management of hybrid systems. IEEE Trans Ind Appl -dec. 2009;45(6):2123–2129, forecasts using V2V and V2I communications. IEEE Trans Intell Transport Syst Feb.
Nov. 2017;18(2):416–30.
[7] Du G, Zou Y, Zhang X, Guo L, Guo N. Heuristic energy management strategy of [31] Hu X, Liu T, Qi X, Barth M. Reinforcement learning for hybrid and plug-in hybrid
hybrid electric vehicle based on deep reinforcement learning with accelerated electric vehicle energy management: recent advances and prospects. IEEE
gradient optimization. IEEE Transactions on Transportation Electrification 2021;7 Industrial Electronics Magazine Sept. 2019;13(3):16–25.
(4):2194–2208, Dec. [32] Wei Z, Dong G, Zhang X, Pou J, Quan Z, He H. Noise-immune model identification
[8] Zhou W, Yang L, Cai Y, Ying T. Dynamic programming for new energy vehicles and state-of-charge estimation for lithium-ion battery using bilinear
based on their work modes Part II: fuel cell electric vehicles. J Power Sources Dec. parameterization. IEEE Trans Ind Electron Jan. 2021;68(1):312–23.
2018;407:92–104. [33] Liu T, Hu X, Hu W, Zou Y. A heuristic planning reinforcement learning-based
[9] Zhou W, Yang L, Cai Y, Ying T. Dynamic programming for new energy vehicles energy management for power-split plug-in hybrid electric vehicles. IEEE Trans Ind
based on their work modes part I: electric vehicles and hybrid electric vehicles. Inf 2019;15(12):6436–6445, Dec.
J Power Sources Dec. 2018;406:151–66. [34] Haarnoja Tuomas, et al. Soft actor-critic: off-policy maximum entropy deep
[10] Tribioli L, Cozzolino R, Chiappini D, Iora P. Energy management of a plug-in fuel reinforcement learning with a stochastic actor.” International conference on
cell/battery hybrid vehicle with on-board fuel processing. Appl Energy Dec. 2016; machine learning. PMLR; 2018.
184:140–54. [35] Haarnoja Tuomas, et al. Soft actor-critic algorithms and applications. In: arXiv
[11] Xie S, Hu X, Xin Z, Brighton J. Pontryagin’s minimum principle based model preprint arXiv:1812.05905; 2018.
predictive control of energy management for a plug-in hybrid electric bus. Appl [36] Vieillard Nino, Pietquin O, Geist M. Munchausen reinforcement learning. Adv
Energy Feb. 2019;236:893–905. Neural Inf Process Syst 2020;33:4235–46.
[12] Chen Z, Mi C, Xiong R, Xu J, You C. Energy management of a power-split plug-in [37] Hou Y, Liu L, Wei Q, Xu X, Chen C. A novel DDPG method with prioritized
hybrid electric vehicle based on genetic algorithm and quadratic programming. experience replay. In: 2017 IEEE international conference on systems, man, and
J Power Sources Feb. 2014;248:416–26. cybernetics (SMC); 2017. p. 316–21.
[13] Lei Z, Qin D, Hou L, Peng J, Liu Y, Chen Z. An adaptive equivalent consumption [38] Schaul Tom, et al. Prioritized experience replay. 2015. arXiv preprint arXiv
minimization strategy for plug-in hybrid electric vehicles based on traffic 1511.05952.
information. Energy Jan. 2020;190. [39] Zhang B, Mi CC, Zhang M. Charge-depleting control strategies and fuel
[14] Kazemi H, Fallah Y, Nix A, Wayne S. Predictive AECMS by utilization of intelligent optimization of blended-mode plug-in hybrid electric vehicles. IEEE Trans Veh
transportation systems for hybrid electric vehicle powertrain control. IEEE Trans. Technol May 2011;60(4):1516–25. https://doi.org/10.1109/TVT.2011.2122313.
Intell. Veh. Jun. 2017;2(2):75–84. [40] Zhang Shuo, et al. Adaptively coordinated optimization of battery aging and
[15] Guo N, Zhang X, Yuan Zou, Du G, Wang C, Guo L. Predictive energy management energy management in plug-in hybrid electric buses. Appl Energy 2019;256:
of plug-in hybrid electric vehicles by real-time optimization and data-driven 113891.
calibration. IEEE Transactions on Vehicular Technology, early access Dec. 2021.
https://doi.org/10.1109/TVT.2021.3138440.
13