Professional Documents
Culture Documents
Double Deep Q-Learning-Based Distributed Operation of Battery Energy Storage System
Double Deep Q-Learning-Based Distributed Operation of Battery Energy Storage System
Double Deep Q-Learning-Based Distributed Operation of Battery Energy Storage System
Abstract—Q-learning-based operation strategies are being within the MG [3]–[5]. However, with an increase in the num-
recently applied for optimal operation of energy storage systems, ber of control devices, the computational burden on the central
where, a Q-table is used to store Q-values for all possible state- EMS and communication complexity of the network increases.
action pairs. However, Q-learning faces challenges when it comes
to large state space problems, i.e., continuous state space prob- Moreover, the centralized management entails a single point
lems or problems with environment uncertainties. In order to of failure, and central unit fault will cause the breakdown
address the limitations of Q-learning, this paper proposes a dis- of the entire system. Low flexibility/expandability is another
tributed operation strategy using double deep Q-learning method. critical limitation of the centralized management [6]. In addi-
It is applied to managing the operation of a community battery tion, since all entities in the system are interconnected and
energy storage system (CBESS) in a microgrid system. In con-
trast to Q-learning, the proposed operation strategy is capable of may have their own operation objectives. Therefore, the
dealing with uncertainties in the system in both grid-connected centralized EMS could have practical limitations for the
and islanded modes. This is due to the utilization of a deep neu- energy scheduling of MGs having diverse objectives and
ral network as a function approximator to estimate the Q-values. ownerships [7]. Therefore, recently, decentralized energy man-
Moreover, the proposed method can mitigate the overestimation agement systems are gaining popularity due to their ability
that is the major drawback of the standard deep Q-learning.
The proposed method trains the model faster by decoupling the to overcome the limitations of centralized EMSs [6], [8], [9].
selection and evaluation processes. Finally, the performance of the A diffusion strategy-based distributed operation of MGs has
proposed double deep Q-learning-based operation method is eval- been proposed to minimize the total operation cost by using
uated by comparing its results with a centralized approach-based a multiagent system [8]. A fully distributed control strategy
operation. has been proposed based on the consensus algorithm for
Index Terms—Artificial intelligence, battery energy storage the optimal resource management in an islanded MG [9].
system, double deep Q-learning-based operation, microgrid oper- However, in these methods, an exact mathematical model
ation, Q-learning, optimization. is required for operation of the MG system. It could be
challenging to develop exact mathematical models for com-
plex systems. Moreover, these methods are the iteration-based
I. I NTRODUCTION optimization, i.e., agents need to exchange information several
NERGY management system (EMS) have been widely times to achieve an optimal result. This type of iterative oper-
E applied for the operation of microgrids (MGs) in both
grid-connected and islanded modes. The major responsibility
ation could be problematic if there is any change in the system
parameters. These methods could take a long time for finding
of the EMS is to optimize the operation of the microgrid’s a new optimal point and might not be suitable for real-time
resources with a defined objective [1], [2]. There are two fun- optimization problems.
damentally different approaches to the design of EMS frame- To overcome the disadvantages of the conventional decen-
works, which are centralized and decentralized approaches. In tralized approaches, reinforcement learning (RL)-based decen-
the centralized approach, a centralized EMS is provided with tralized energy management has been introduced as a potential
the relevant information of all the distributed energy resources solution for optimal operation of MGs [10]–[12]. In RL,
agents learn to achieve a given task by interacting with their
Manuscript received January 16, 2019; revised April 1, 2019 and May environment. Since the agents do not require any model of the
9, 2019; accepted June 14, 2019. Date of publication June 20, 2019; environment, they only need to know the existing states and
date of current version December 23, 2019. This work was supported by
the Korea Institute of Energy Technology Evaluation and Planning and possible actions in each state. This method drives the learn-
the Ministry of Trade, Industry and Energy of South Korea under Grant ing process on the basis of penalties or rewards assessed on
20168530050030. Paper no. TSG-00082-2019. (Corresponding author: Hak- a sequence of actions taken in response to the environment
Man Kim.)
The authors are with the Department of Electrical Engineering, dynamics [13], [14].
Incheon National University, Incheon 402-749, South Korea (e-mail: Q-learning being a popular method in RL, is widely used for
hmkim@inu.ac.kr). optimal operation of MGs [15], [16]. In Q-learning, a Q-table
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. is used to represent the knowledge of the agent about the envi-
Digital Object Identifier 10.1109/TSG.2019.2924025 ronment. The Q-value for each state and action pair reflects
1949-3053 c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
458 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020
proved that the proposed method can get similar results with
those of a centralized EMS, despite being a decentralized
approach. The major contributions of this study are listed
as following.
• A distributed operation strategy is proposed for CBESS
using DDQN method in both grid-connected and islanded
modes with different objectives. Due to the use of dif-
ferent DNNs, the CBESS can be continuously trained
Fig. 1. Microgrid configuration. with a new data set in real-time operation to improve
the decision accuracy.
• The proposed operation strategy is capable of handling
the future reward associated with taking such an action in the system uncertainties in the environment. The charg-
this state. However, Q-learning-based operation methods are ing/discharging decisions of CBESS are adjusted with
only suitable for the problems with small state space. They the actual data by using a target-network with optimal
are not suitable for a continuous state space or for an environ- parameters.
ment with uncertainties [17], [18]. Therefore, a model which • A comparison between the proposed method and other
maps the state information provided as input to Q-values methods is presented to evaluate the performance of
of the possible set of actions should be developed. For the proposed algorithm. Simulation results have proved
this reason, a deep neural network (DNN) comes to play that the proposed method can get similar results with
the role of a function approximator, which can take state those of a centralized EMS, despite being a decentralized
information input, and learn to map them to Q-values for all approach.
possible actions. Combining Q-learning with DNN leads to • The performance of the proposed method is also tested
a method called deep Q-learning or deep Q networks (DQN), for a longer time span (48-hour scheduling horizon).
which enhances the performance of Q-learning for large scale The results show that the proposed can perform better
problems [19]–[21]. In DQN, same weight values are used in case of a longer time span. The scalability of the
for both selection and evaluation of an action. It can lead proposed method is also discussed for a larger scale, e.g.,
to overoptimistic value estimates [17], [22]. To mitigate this multi-microgrid system.
problem, double deep Q-learning (DDQN) has been developed
to decouple the selection from the evaluation. In DDQN,
a primary network is used to choose the action and a tar-
II. S YSTEM M ODEL
get network is used to generate a target Q-value for that
action [22]. Therefore, a DDQN-based operation is suitable A. System Configuration
for the agent with a large state space and the agent can avoid Fig. 1 describes the system configuration, which is com-
being trapped in local optimization. prised of an MG and a CBESS. The MG considered in this
Due to the advantage of DDQN, a distributed operation study consists of a controllable distributed generator (CDG),
strategy for managing the operation of a community bat- a renewable distributed generator (RDG), a battery energy
tery energy storage system (CBESS) in an MG system is storage system (BESS) as the energy storage device, and res-
proposed using DDQN. The MG system is comprised of an idential loads. The CBESS is a community battery energy
MG and a CBESS, where a microgrid EMS (MG-EMS) is storage system with different ownership from the MG [7]. The
used for optimizing the operation of the MG and a DDQN- CBESS could be owned by any electrical company in a local
based operation is used for CBESS. In grid-connected mode, electric power distribution network. The main purpose of the
the actions of the CBESS are only dependent on the mar- CBESS is to maximize the profit in normal mode and improve
ket price signals. The effect of uncertainty of market price the reliability of system during emergencies. In grid-connected
signals is analyzed in this mode. On the other hand, in mode, power can be traded among MG, CBESS, and the util-
islanded mode, the actions of CBESS are dependent on ity grid to minimize the total operation cost. In islanded mode,
both the trading price and surplus/shortage amount of the CBESS and MG can trade power to minimize load shedding in
MG. Due to the uncertainty of load and renewable resources the network. The MG is operated by an MG-EMS for minimiz-
in the MG, the uncertainties of surplus and shortage power ing its total operation cost. MG-EMS communicates with the
of the MG are also considered in this study. In contrast to utility grid to get the buying/selling price signals and decides
Q-learning, the proposed operation strategy is capable of deal- the amount of power to be traded with the utility grid and
ing with uncertainties in the system in both grid-connected and CBESS in grid-connected mode. However, in islanded mode,
islanded modes. Moreover, the DDQN decouples the selec- the MG cannot trade power with the utility grid. Thus, the
tion and evaluation of an action, thus the CBESS can avoid MG cooperatively operates with the CBESS to minimize the
the trapping into local optima. Finally, in order to analyze load shedding amount.
the effectiveness of the proposed DDQN-based optimization The operation of the CBESS is based on double deep
method, the operation results of the proposed method are Q-learning method for both grid-connected and islanded
compared with the conventional centralized EMS results, modes. The detailed optimization model for CBESS is
and other learning methods’ results. Simulation results have explained in the following sections.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 459
B. Backgrounds of Q-Learning and Deep Q-Learning network is used to compute loss after every action taken by
Methods the agent during training time. The weight parameters θ (i − 1)
In this section, the backgrounds of Q-learning and deep belongs to the Target-network and θ (i) belongs to the Q-
Q-network are introduced. network. The actions of the agent are selected according to
1) Q-Learning: Q-learning is a popular model-free rein- the behavior policy μ(a|s). On the other side, the greedy tar-
forcement learning, where an agent explores the environment get policy π(a|s) selects only actions a that maximize Q(s, a),
and finds the optimal way to maximize the cumulative reward. that is used to calculate the TD-target. This technique is called
In Q-learning, the agent only needs to know its state space and fixed Q-targets. In fact, the weights are updated only once in
possible actions. Each state-action pair is assigned by a Q- a while.
value, which represents the quality of the action for the given Important feature added to deep Q-network is experience
state. When the agent carries out an action from the given state, replay. It means that the agent can store its past experiences
it receives a reward and moves to a new state. The reward is and use them in batches to train DNN. Storing the experi-
used to update the Q-value using Bellman equation [13], [17] ences allows the agent to randomly draw the batches and help
as follows. the network to learn from a variety of data instead of just
formalizing decisions on immediate experiences. In order to
Q(s, a) ← Q(s, a) + α. r + γ . max Q s , a − Q(s, a) avoid storage issues, the buffer of experience replay is fixed,
a
and as the new experiences get stored the old experiences get
s←s (1) removed. For training deep neural networks, uniform batches
where: of random experiences are drawn from the buffer.
Q(s,a): Q-value for the current state s and action a pair 3) Double Deep Q-Learning: In standard DQN and
α: learning rate of convergence Q-learning, the “max” operator uses the same weight val-
γ: discount factor of future rewards ues both to select and to evaluate an action [13], [17]. This
r: immediate reward. makes it more likely to select overestimated values, resulting
The agent explores the environment for a large number in overoptimistic value estimates. To prevent this, the selec-
of episodes. In each episode, the agent starts from an initial tion should be decoupled from the evaluation. This is the
state and performs several actions until reaching the goal state idea behind double deep Q network (DDQN).
and updates its knowledge (Q-values). The choosing action is In DDQN, instead of taking the “max” over Q-values while
based on the ε-greedy policy, which is a way of selecting ran- computing the target Q-value during training, we use a primary
dom actions with uniform distribution from a set of available network to choose the action and Target-network to gener-
actions [23], [24]. Using this policy, the agent can select an ate a target Q-value for that action. By decoupling the action
action with the current optimal policy with probability (1-ε) choice from the target Q-value generation, it is able to reduce
or try other actions with probability ε in a hope for a bet- the overestimation and helps to train faster and more reliably.
ter reward resulting in improvement in Q-values and thereby, The TD-Target is as follows:
deriving better optimal policy.
2) Deep Q-Learning: In Q-learning method, a Q-table per- yi := Eα ∼μ r + γ Q s , arg max Q s , a; θi ; θi−1 |St = s, At = a
a
forms well in an environment where the state space is small. (3)
However, with an increase in the state space size, Q-learning
becomes intractable [18]. Therefore, a model that maps the It can be observed that the Target-network with parame-
state information provided as input to Q-values of the possi- ters θ (i − 1) evaluates the quality of the action, the action
ble set of actions should be developed. For this reason, a deep itself is determined by the Q-network that has parameters θ (i).
neural network (DNN) comes to play the role of a function The calculation of new TD-Target yi can be summarized in
approximator, which can take state information input in the the following steps: Q-network uses next state s to calculate
form of a vector, and learn to map them to Q-values for all qualities Q(s , a) for each possible action a in state s . The
possible actions. Since a DNN is being used as a function “argmax” operation applied on Q(s , a) chooses the action a*
approximator of the Q function, this process is called deep that belongs to the highest quality. The quality Q(s , a∗ ) is
Q-learning. The main objective is to minimize the distance determined by the Target-network.
between Q(s, a) and TD-target (temporal difference target). Deep learning techniques require large amount of train-
This objective can be expressed as minimization of the square ing data sets and it is challenging to generate the training
error loss function, as shown in (2). data sets in many cases [13], [17]. However, in DDQN, the
agent explores the environment and makes the data set by sav-
Li (θi ) = Eα∼μ (yi − Q(s, a; θi ))2 ing its experiences in a memory. In each episode, the training
data set is randomly selected from the memory to optimize the
where yi := Eα ∼π r + γ max Q s , a ; θi−1 |St = s, At = a DNN and estimate the Q values. Therefore, generation of sep-
a
arate training data set is not required for DDQN. Moreover,
(2)
the agent always updates its memory with new experiences
The TD-target yi and Q(s, a) are estimated separately by two and keeps on training based on the continuous feedback to
different neural networks, which are often called the Target and maximize the reward. In a real-time problem, the agent can
Q-networks. The Q-values generated by this separate Target use the Target-network to estimate the Q-values while the
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
460 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020
Q-network can be continuously trained to get more accurate Algorithm 1: Reward/Penalty for CBESS in Grid-
decisions. Connected Mode
input: s = [interval, SOC, price ± u1]
select an action a based on ε-greedy policy
C. A Comparison Between DDQN and Other with probability ε select a random action
Optimization Methods otherwise select a = argmaxa , {Q(s, a’)}
carry out action a
Dynamic programming (DP) is a popular method for Pcharge_ max = min{Pmax .(SOCmax -SOC), Pcharramp_ max }
optimization problems [25]. DP solves for the optimal pol- Pdischarge_ max = min{Pmax .(SOC-SOCmin ), Pdisramp_ max }
icy or value function by recursion. It requires knowledge of if a = “charge” then:
the markov decision process (MDP) or a model of the envi- if SOC = SOCmax then:
ronment so that the recursions can be carried out. Thus, it is r = a high penalty
else:
lumped under “planning” instead of “learning” and its objec- r = −Pcharge .price
tive is to determine the optimal actions with the given MDP. If end
the model of the environment is not clearly known, the use of else if a = “idle” then:
DP could be challenging. r=0
else:
On the other hand, reinforcement learning (RL) doesn’t
if SOC = SOCmin then:
require knowledge of the model of the environment. It is r = a high penalty
iterative and simulation-based and learns by bootstrapping, else:
i.e., the value of a state or action is estimated using the values r = Pdischarge .price
of other states or actions. Recently, various RL methods are end
end
also available in the literature, e.g., optimal policy RL [13], observe new state s = [interval2, SOC2 , price2 ± u1] and r
path integral control approach to RL [26], or actor-critic (AC)
RL [13], [27].
Among these methods, AC RL is similar to DDQN, which
is used in this study. DDQN is categorized as the value
function-based algorithms. An agent learns action values and
then acts mostly greedily with respect to those values. The
policy is inherently given by the value function itself. Actor-
critic methods [27] is a type of policy gradient algorithms.
AC method implements generalized policy iteration where the
actor aims at improving the current policy and the critic eval-
uates the current policy. Two different neural networks, i.e., Fig. 2. Double deep Q-learning principle diagram.
critic network and actor network are used in AC method. The
critic uses an approximation architecture and simulation to
learn a value function, which is then used to update the actor’s could be changed any time. It leads to increase in the num-
policy parameters in a direction of performance improvement. ber of states of CBESS significantly. Therefore, a deep neural
However, AC also have some limitations, especially for the network is used as a function approximator, which can take
under-consideration problem in this study. There are two main state information and learn to map them to Q-values for all
advantages of using DDQN instead of actor-critic algorithm possible actions. In this section, a DDQN-based distributed
in this study are as following. operation of CBESS is introduced in detailed to estimate the
• Actor-critic method is useful for a continuous action optimal action of the CBESS. The overall deep Q-learning
space problem [27]. However, in this paper, a BESS principle diagram for the CBESS is summarized in Fig. 2.
model is presented with three possible actions, i.e., The CBESS agent starts from an initial state s. The CBESS
charge, discharge, and idle modes. DDQN is easier to carries out several actions and receives rewards according
implement for the stated problem. to the actions and learns by replay from its experiences.
• DDQN method guarantees a global optima. However, as CBESS can be either in charging, discharging or idle mode
the optimization of actor is a gradient-based method, it in each state. Thus, SOC of CBESS can also be increased,
is easy to get trapped in local optimal result [27]. It take decreased, or kept the same with the previous state depending
a lot of training time to get a global optimal result. on the chosen action. The rewards/penalties of performing an
To evaluate the performance of the proposed method, a com- action for CBESS are summarized in Algorithms 1 and 2 for
parison between two different learning methods is presented grid-connected and islanded modes, respectively.
for operation of the CBESS in numerical results section. 1) Reward Function for CBESS: Algorithms 1 and 2 play
an important role in determining the reward function for the
CBESS in different operation modes. These functions define
D. The Proposed DDQN-Based Operation of CBESS what are the good/bad events for the CBESS. They also map
In this study, the uncertainties in market price signals and each state-action pair to a single number, a reward, indicating
in amount of surplus/shortage power are considered for opera- the intrinsic desirability of that state [13]. The CBESS’s objec-
tion of the CBESS. Due to these uncertainties, the environment tive is to maximize the total reward it receives in the long run.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 461
Algorithm 2: Reward/Penalty for CBESS in Islanded Algorithm 3: DDQN-Based Operation Strategy for
Mode CBESS
input: s = [interval, SOC, price±u1, surplus±u2, shortage±u3] determine operation mode of CBESS
select an action a based on ε-greedy policy initialize replay memory D and mini-batch size
with probability ε select a random action initialize action-value functions with random weights for
otherwise select a = argmaxa ,{Q(s, a’)} Q-network (θ) and target-network (θ = θ)
carry out action a for episode = 1, N do:
Pcharge_ max = min{Pmax .(SOCmax -SOC), surplus, Pchar ramp_ max }
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
462 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020
error loss function with respect to network parameters of the operation mode. The agent carries out the action and receives
Q-network (θ ). After a constant time step, all the parameters a reward and moves to a new state. This process is explaned
of the Target-network are replace by the Q-network’s param- in detail in Algorithms 1 and 2. After finishing a transition
eters. Finally, with the optimal parameters, they can guide the from s to s’, the transition information {s,a,r,s’} is stored in
CBESS for selecting an optimal action in a given state. the replay memory D. A sample random mini-batch is taken
from memory D. Then the agent estimates the actual/target
E. Operation Strategy for MG and CBESS y for the experience and performs an optimization step to
minimize the square loss function by using gradient decent
In this section, the detailed operation strategies for MG and
method with respect to Q-network’s weights (θ ). The weight
CBESS are presented, as shown in Figs. 3(a), (b), respectively.
parameters of the Q-network is updated after each training
These strategies are implemented for both grid-connected and
step and the optimal network is used to select the action of
islanded modes. In grid-connected mode, MG is connected to
CBESS. After a constant number of steps, the weight parame-
the utility grid. The hourly day-ahead market price signals,
ters of Target-network are reset by Q-network’s parameters.
generation capabilities of CDG, RDG, and BESS along with
The Target-network is used to evaluate the quality of the
load profiles of MG system are taken as inputs. The MG-
selected action. Finally, the optimal action and the amount of
EMS receives this information and performs optimization for
charging/discharging power are determined by using the DNN
all participating components to minimize the total operation
with the optimal weight parameters. The overall process of the
cost. The decision about surplus and shortage amounts is based
CBESS is summarized in Fig. 3(b).
on the generation cost of the CDG unit, market trading prices,
After gathering the information from CBESS and the util-
and local load demands of MG.
ity grid, MG-EMS decides the amount of buying/selling power
However, in the islanded mode, MG is disconnected from
from/to the utility grid and CBESS and informs the optimal
the utility grid. The main objective is to minimize the amount
results to its components. However, in islanded mode, there
of load shedding in the MG system. The amount of sur-
is no connection to the utility grid. Load shedding could be
plus/shortage power is now only based on the load profiles
implemented to maintain the power balance. After perform-
and the available generated power. Finally, The MG-EMS will
ing power sharing with the CBESS, the MG-EMS reschedules
send its information of surplus and shortage power to exter-
the operation of all the components based on the charg-
nal systems and wait for feedback information. The detailed
ing/discharging amount from CBESS. Finally, a load shedding
process is summarized in Fig. 3 (a).
scheme is implemented for maintaining the power balance in
On the CBESS side, the detail DDQN-based operation strat-
the whole system in case of having shortage power.
egy for CBESS is presented in Fig. 3(b). Firstly, the weight
parameters of deep neural networks (DNNs) (i.e., Q-network
and Target-network) are randomly initialized and the replay F. Operation of CBESS in a Multi-Microgrid System
memory is also initialized with size D. In each episode, the In a large area, there may be several MGs. They are inde-
agent starts from the initial state and selects an action based on pendently operated in normal operation. However, the uncer-
ε-greedy policy. The state of CBESS is changed based on the tainties have a high effect on the operation of a single MG. The
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 463
1 CDG is on
ui,t = ∀i ∈ I, t ∈ T (8)
0 CDG is off
amount of load shedding could also be high during islanded
yi,t = max ui,t − ui,t−1 , 0 ∀i ∈ I, t ∈ T (9)
operation and may result in reduction of system reliability.
Recently, the concept of multi-microgrid (MMG) is introduced zi,t = max ui,t−1 − ui,t , 0 ∀i ∈ I, t ∈ T (10)
as a potential solution to overcome these problems [7], [28]. PCDG
i,t − PCDG
i,t−1 ≤ RUi · (1 − yi,t ) + max{RUi , Pmin
i } · yi,t ∀i ∈ I, t ∈ T
MMG system is formed by connecting several neighboring (11)
MGs to exchange power among MGs. In case a CBESS con-
PCDG
i,t−1 − PCDG
i,t ≤ RDi · (1 − zi,t ) + max{RDi , PCDG
i,t−1 } · zi,t ∀i ∈ I, t ∈ T
nects with an MMG system, as shown in Fig. 4. Our proposed
method should be extended to optimal operation of the whole (12)
system in a distributed manner.
The power balance between the power sources and power
The internal trading in the MMG system is the main con-
demands is given by (13). The buying/selling power is the
cern for this extension. A retailer agent is introduced to solve
amount of power trading with the external systems, which is
the problem of internal trading. It provides a dynamic internal
divided into trading with the utility grid or the CBESS, as
trading prices for MMG system to maximize the total internal
given in (14), (15), respectively.
trading amount. In order to optimize the dynamic internal trad-
Buy
ing prices, a DDQN-based decision making for the retailer
t + Pt
PPV + + Pt + PB−
WT
PCDG
i,t t
could be suitable for this issue based on the amount of i∈I
surplus/shortage power in each individual MG. Finally, the
= Pl,t + PSell
Load
t + PB+
t ∀t ∈ T (13)
amount of surplus/shortage power is updated after performing
l∈L
internal trading among MGs, as shown in (4), (5).
PSell
t = PtSell_Grid + PSell_CBESS
t ∀t ∈ T (14)
N
N Buy
Pt = Pt
Buy_Grid Buy_CBESS
+ Pt ∀t ∈ T (15)
Psurplus,t = Psurplus,n,t − sell,n,t ∀t ∈ T
Pint (4)
n=1 n=1 The constraints related to BESS are given by (16)-(20).
N N Constraints (16), (17) show the maximum charg-
Pshortage,t = Pshortage,n,t − buy,n,t ∀t ∈ T
Pint (5) ing/discharging power of the BESS considering the ramp
n=1 n=1 rates. SOC is updated by (18) after charging/discharging
After solving the internal trading problem, the remaining power at each interval of time. Equation (19) shows that the
surplus/shortage power can be traded with the CBESS and the value of SOC is set by initial SOC at the first interval of
utility grid. Thus, the problem becomes same with the single time (t = 1). The operation bounds of BESS are enforced
MG problem and can be solved by using the proposed method. by (20).
The optimal model for dynamic internal trading prices in
Cap 1
0 ≤ PB+ ≤ min PB · SOCmax
B
− SOCt−1
B
· , PB+ ∀t ∈ T
MMG system is a suitable future extension of this study. In t
1 − LB+ Ramp,max
the next section, a detailed mathematical model for MG system (16)
are presented. Cap
0≤ PB−
t ≤ min PB · SOCt−1
B
− SOCmin
B
· (1 − LB− ), PB−
Ramp,max ∀t ∈ T
(17)
G. Mathematical Model for Operation of MG 1 1
SOCtB = B
SOCt−1 − · · PB−
t − Pt · (1 − L )
B+ B+
∀t ∈ T (18)
In this section, a mixed integer linear programing (MILP)- PB
Cap 1 − LB−
based formulation of MG system is presented for both B
SOCt−1 = SOCini
B
if t = 1 (19)
grid-connected and islanded modes. In grid-connected mode, B
SOCmin ≤ SOCtB ≤ B
SOCmax ∀t ∈ T (20)
the objective function (6) is to minimize the total operation
cost associated with the fuel cost, start-up/shut-down cost of In centralized approach for CBESS, a battery energy man-
CDGs, and cost/benefit of purchasing/selling power from/to agement system (BEMS) is developed for maximizing its
the utility grid, as shown in equation (6). profit by optimal charging/discharging decisions. The objective
⎧ ⎫ function is expressed by (21).
⎪
⎨ CiCDG · PCDG + yi,t · CiSU + zi,t · CiSD ⎪
⎬
i,t
i∈I t∈T
Min (6) CB− CB+
· PRt − PGrid,t · PRt
Sell Buy
⎪
⎩ +
Buy
PRt · Pt
Buy
− PRSell
t · PSell
t
⎪
⎭
Max Pt (21)
t∈T t∈T t∈T
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
464 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020
Min CiCDG · PCDG
i,t + yi,t · CiSU + zi,t · CiSD Fig. 5. Test MG system: based on the CERTS MG test-bed [33].
i∈I t∈T
pen
+ Ct · PShort
t (22)
t∈T
shortage power.
t + Pt
PPV + i,t + Pt = + PB+
t + Pt
WT
PDG B−
PLoad Sur
Cap
l,t 0 ≤ PCB+
t ≤ min PCB · SOCmax
CB
−SOCt−1
CB
i∈I l∈L
− PShort ∀t ∈ T (23) 1 CB+
t
× , PSur
, P ∀t ∈ T (28)
1−LCB+ t Ramp,max
PSur = PPV + PWT + PDG + PB− − PLoad − PShort Cap
max,t t t i,max t l,t t P CB −SOCCB · (1−LCB− ),
· SOCt−1
i∈I 0 ≤ PCB−
t ≤ min CB min ∀t ∈ T.
l∈L
PShort
t , PCB−
Ramp,max
− PB+
t ∀t ∈ T (24) (29)
Similarly, in centralized approach, the objective of BEMS
is to minimize the amount of load shedding in MG system. H. Scheduling and Optimal Power Flow
The objective function is presented by (25).
The operation of a power system is generally carried out in
two steps, i.e., optimal scheduling and optimal power flow. In
Min Short
Pt · pent
Short
(25) the first step, the commitment statuses of the components are
t∈T
determined including the power exchange with the main grid.
In this paper, a distributed operation strategy using DDQN In the second step, the status of components is shared with the
is proposed for optimal operation of CBESS. In both opera- optimal power flow problem and is allowed to deviate minutely
tion modes, the objective of CBESS is to maximize its profit from the set values. In this step, the network losses are deter-
by trading power with the MG and the utility grid. However, mined and are sent back to the first step problem. Scheduling
the charging/discharging bounds of CBESS is different in problem is again solved including the network losses. This
each operation mode. In grid-connected mode, the amount of process is repeated in an iterative way till the convergence is
charging/discharging for CBESS is totally dependent on the achieved.
market price signals. Maximum charging/discharging power However, MG is localized and small-scale power system and
of CBESS is given by (26), (27) considering the ramp rate the length of lines is short. Therefore, the resistance of the MG
constraints. lines between two buses is very small and the power line losses
0 ≤ PCB+
Cap
≤ min PCB · SOCmax
CB
− SOCt−1
CB
·
1
, PCB+ ∀t ∈ T
are usually negligible [29], [30]. Therefore, power losses for
t
1 − LCB+ Ramp,max power flow are not considered in this study and the network
(26) constraints, e.g., constraints about voltages and currents are
0≤
Cap
≤ min PCB · SOCt−1 − SOCmin · (1 − LCB− ), PCB− assumed to be fulfilled [31], [32]. The detailed analysis on
PCB− Ramp,max ∀t ∈ T
CB CB
t
the operation of CBESS considering the optimal power flow
(27) and network constraints could be a suitable extension of this
On the other hand, in islanded mode, the CBESS can only study for a larger system, e.g., multi-microgrid system.
charge surplus power from MG and discharge to MG during
intervals having shortage power. Constraints (28), (29) show
III. N UMERICAL R ESULTS
the maximum bounds for charging/discharging amount at each
interval of time considering the ramp rates. These constraints A. Input Data
also ensure that charging is only possible when MG has sur- Fig. 5 shows a test MG system based on the CERTS MG
plus power, while discharging is only possible when MG has test-bed [33] considering the uncertainty of environment, e.g.,
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 465
TABLE I
T HE D ETAIL PARAMETERS OF BESS, CBESS, AND CDG
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
466 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020
Fig. 8. (a) Total reward for CBESS during 80000 episodes; (b) Total rewards
for CBESS during last 100 episodes.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 467
TABLE II
T OTAL O PERATION C OST OF MG AND T OTAL P ROFIT
OF CBESS IN G RID -C ONNECTED M ODE
C. Operation of the System in Islanded Mode total reward of CBESS can also converge at optimal values
In islanded mode, MG cannot trade power with the utility based on the real values of the trading price signals, surplus,
grid. CBESS plays an important role in the operation of MG and shortage amount. In this case, the total reward could be
for reducing load shedding amount. CBESS is used to shift lower or higher than the total reward in case 1 due to the
the surplus power to other intervals having shortage power. It changes in market price signals (±5%) and surplus/shortage
means CBESS charges the surplus power and discharges for power (±10%). The accuracies in both cases are also summa-
fulfilling the shortage power. The operation of all components rized in Fig. 17 after training time. It can be observed that
in MG system is shown in Fig. 13. The MG-EMS controls the accuracies in both cases are acceptable (≈1) for operation
these components to fulfill the load demand and reduces the of CBESS. In case 2, since the trading price signals and sur-
load shedding amount during some peak load intervals. The plus/shortage amount are randomly changed in each step in
detailed amount of surplus/shortage power of MG is shown in between the upper and the lower bounds, the state space is
Fig. 14 in each interval of time. The trading price signals are much bigger than case 1. Therefore, the training accuracy in
shown in Fig. 15, the price is decided based the percentage of case 1 converges faster than case 2.
critical load in each interval. In islanded mode, we assume that By using the optimal DDNs, the operation of CBESS is
the uncertainties of trading price signals and surplus/shortage given in Fig. 18 by the proposed method and the central-
amount are ±5% and ±10%, respectively. ized approach. It can be observed that the CBESS is operated
Due to increase in the number of uncertainty parameters, in an optimal way by both methods. The CBESS is decided
the training episodes are also increased to improve the accu- to charge the surplus power in lower price intervals and dis-
racy of training model. In this study, the model is trained for charged to fulfill the shortage power in high price intervals.
100000 episodes for two cases: without uncertainty (case 1) The load shedding amount of MG system can also be signifi-
and with uncertainty (case 2). Fig. 16 shows the total reward cantly reduced by using the CBESS, as shown in Fig. 19. In
of CBESS during the last 100 episodes of the training time. case of without the CBESS, the load shedding is performed
The total reward of CBESS after 100000 episodes converge at during almost peak load intervals (intervals 7-10 and 15-19).
an optimal value in case 1. On the other hand, in case 2, the With the CBESS, the operation of CBESS is determined by
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
468 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020
Fig. 20. (a) Market price signals; (b) Optimal operation of CBESS during
48-hour scheduling horizon.
Fig. 19. Load shedding amount in MG system. the CBESS is trained to maximize its total profit during that
scheduling horizon, as shown in Fig. 20(b). It can be observed
TABLE III
that the CBESS charges power at the end of the first day with
T OTAL O PERATION C OST OF MG AND T OTAL
P ROFIT OF CBESS IN I SLANDED M ODE lower price and uses for the next day to reduce the operation
cost.
Similarly, in islanded mode, the CBESS is trained for
minimizing the load shedding during the 48-hour scheduling
horizon. By using the proposed method, the amount of surplus
power during the last intervals of the first day can be charged
to the CBESS to use for fulfilling the shortage power in the
next day. Therefore, the load shedding amount could also be
reduced with a longer time scheduling.
The results for longer scheduling horizon (48 hours) are
better than those of the shorter scheduling horizon (24 hours).
However, the results are also dependent on the forecasted data
using the optimal DNN, the amount of load shedding is only and the uncertainties in the system. The accuracy of forecasted
high in intervals 7, 15, and 19. data could be reduced for considering a longer time period.
Similar to the grid-connected mode, thanks to the optimal It leads to increase in the uncertainties in the system. That
DDN, the CBESS can also adjust its decisions in this opera- could make worse situation for the system reliability and oper-
tion mode based on the real-time data of the amount of surplus ation cost, as mentioned in the previous sections. Moreover,
/shortage power. It leads to reduction in the load shedding the complexity of the system also increases in terms of training
amount in MG compared with other conventional methods process whenever any event occurs in the system. Therefore,
(i.e., MILP, Q-learning) and improve the system reliability. a day-ahead scheduling (24-hour time span) is proposed to
Finally, the total operation cost of MG and total profit of preserve the higher reliability of the system.
CBESS are also summarized in Table III for islanded mode.
IV. C ONCLUSION
D. Operation of the CBESS System in a 48-Hour Time Span In this paper, a distributed operation strategy has been
In order to show the effectiveness of the proposed method developed for managing the operation of a CBESS in an
for a longer time span, an extension of this study is presented MG system using double deep Q-learning method. The MG
for a 48-hour scheduling of CBESS using DDQN. The opera- system is comprised of an MG and a CBESS, where an MG-
tion of CBESS is analyzed in both grid-connected and islanded EMS was used for optimizing the operation of the MG. In
modes. The results are also compared with the day-ahead contrast to Q-learning, the proposed operation strategy was
scheduling (24-hour time span). The market price signals are capable of dealing with uncertainties in the system in both
shown in Fig. 20(a) for 48 hours. In grid-connected mode, grid-connected and islanded modes. Moreover, by decoupling
the objective is to maximize the total profit of the CBESS the selection and evaluation of an action, the proposed method
during the time scheduling. Therefore, the CBESS is always reduced the overestimation. Finally, a comparison between the
discharged to the minimum operation bound at the end of the proposed strategy and other methods has been presented for
day for 24-hour time span. However, in a 48-hour time span, showing the effectiveness of the proposed method. The CBESS
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 469
can optimally work with the proposed operation strategy with [25] D. P. Bertsekas, Dynamic Programming and Optimal Control, vol. 1.
a large number of episodes. The CBESS accurately determined Belmont, MA, USA: Athena Sci., Jan. 1995.
[26] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integral con-
the optimal operation like the centralized-based method in both trol approach to reinforcement learning,” J. Mach. Learn. Res., vol. 11,
grid-connected and islanded modes. pp. 3137–3181, Nov. 2010.
[27] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAM J.
Control Optim., vol. 42, no. 4, pp. 1143–1166, Apr. 2003.
R EFERENCES [28] S. A. Arefifar, M. Ordonez, and Y. A.-R. I. Mohamed, “Energy manage-
ment in multi-microgrid systems—Development and assessment,” IEEE
[1] C. Chen, S. Duan, T. Cai, B. Liu, and G. Hu, “Smart energy management Trans. Power Syst., vol. 32, no. 2, pp. 910–922, Mar. 2017.
system for optimal microgrid economic operation,” IET Renew. Power [29] A. Hussain, V.-H. Bui, and H.-M. Kim, “Robust optimal operation
Gener., vol. 5, no. 3, pp. 258–267, May 2011. of AC/DC hybrid microgrids under market price uncertainties,” IEEE
[2] Q. Jiang, M. Xue, and G. Geng, “Energy management of microgrid Access, vol. 6, pp. 2654–2667, 2018.
in grid-connected and stand-alone modes,” IEEE Trans. Power Syst., [30] T. Dragičević, J. M. Guerrero, J. C. Vasquez, and D. Škrlec, “Supervisory
vol. 28, no. 3, pp. 3380–3389, Aug. 2013. control of an adaptive-droop regulated DC microgrid with battery man-
[3] F. Katiraei, R. Iravani, N. Hatziargyriou, and A. Dimeas, “Microgrid agement capability,” IEEE Trans. Power Electron., vol. 29, no. 2,
management,” IEEE Power Energy Mag., vol. 6, no. 3, pp. 54–65, pp. 695–706, Feb. 2014.
May/Jun. 2008. [31] S. Parhizi, A. Khodaei, and M. Shahidehpour, “Market-based versus
[4] W. Su and J. Wang, “Energy management systems in microgrid opera- price-based microgrid optimal scheduling,” IEEE Trans. Smart Grid,
tions,” Elect. J., vol. 25, no. 8, pp. 45–60, Oct. 2012. vol. 9, no. 2, pp. 615–623, Mar. 2018.
[5] D. E. Olivares, C. A. Cañizares, and M. Kazerani, “A centralized energy [32] A. Khodaei, “Microgrid optimal scheduling with multi-period islanding
management system for isolated microgrids,” IEEE Trans. Smart Grid, constraints,” IEEE Trans. Power Syst., vol. 29, no. 3, pp. 1383–1392,
vol. 5, no. 4, pp. 1864–1875, Jul. 2014. May 2014.
[6] L. Meng, E. R. Sanseverino, A. Luna, T. Dragicevic, J. C. Vasquez, and [33] R. Bayindir, E. Hossain, E. Kabalci, and K. M. M. Billah, “Investigation
J. M. Guerrero, “Microgrid supervisory controllers and energy manage- on North American microgrid facility,” Int. J. Renew. Energy Res., vol. 5,
ment systems: A literature review,” Renew. Sustain. Energy Rev., vol. 60, no. 2, pp. 558–574, Jun. 2015.
pp. 1263–1273, Jul. 2016. [34] X. Li, D. Hui, and X. Lai, “Battery energy storage station (BESS)-based
[7] V.-H. Bui, A. Hussain, and H.-M. Kim, “A multiagent-based hierarchical smoothing control of photovoltaic (PV) and wind power generation
energy management strategy for multi-microgrids considering adjustable fluctuations,” IEEE Trans. Sustain. Energy, vol. 4, no. 2, pp. 464–473,
power and demand response,” IEEE Trans. Smart Grid, vol. 9, no. 2, Apr. 2013.
pp. 1323–1333, Mar. 2018. [35] M. Abadi et al., “TensorFlow: A system for large-scale machine
[8] V.-H. Bui, A. Hussain, and H.-M. Kim, “Diffusion strategy-based dis- learning,” in Proc. OSDI, vol. 16, Nov. 2016, pp. 265–283.
tributed operation of microgrids using multiagent system,” Energies, [36] IBM ILOG CPLEX V12.6 User’s Manual for CPLEX 2015, CPLEX
vol. 10, no. 7, p. 903, Jul. 2017. Division, ILOG, Incline Village, NV, USA, 2015.
[9] Y. Xu and Z. Li, “Distributed optimal resource management based on [37] V.-H. Bui, A. Hussain, and H.-M. Kim, “Optimal operation of
the consensus algorithm in a microgrid,” IEEE Trans. Ind. Electron., microgrids considering auto-configuration function using multiagent
vol. 62, no. 4, pp. 2584–2592, Apr. 2015. system,” Energies, vol. 10, no. 10, pp. 1484–1500, Sep. 2017.
[10] E. C. Kara, M. Berges, B. Krogh, and S. Kar, “Using smart devices [38] A. Hussain, V.-H. Bui, and H.-M. Kim, “A resilient and privacy-
for system-level management and control in the smart grid: A rein- preserving energy management strategy for networked microgrids,”
forcement learning framework,” in Proc. IEEE Smart Grid Commun. IEEE Trans. Smart Grid, vol. 9, no. 3, pp. 2127–2139, May 2018.
(SmartGridComm), Nov. 2012, pp. 85–90.
[11] A. L. Dimeas and N. D. Hatziargyriou, “Multi-agent reinforcement Van-Hai Bui (S’16) received the B.S. degree in
learning for microgrids,” in Proc. IEEE Power Energy Soc. Gen. electrical engineering from the Hanoi University
Meeting, Jul. 2010, pp. 1–8. of Science and Technology, Vietnam, in 2013.
[12] G. K. Venayagamoorthy, R. K. Sharma, P. K. Gautam, and A. Ahmadi, He is currently pursuing the combined master’s
“Dynamic energy management system for a smart microgrid,” IEEE and Ph.D. degrees with the Department of
Trans. Neural Netw. Learn. Syst., vol. 27, no. 8, pp. 1643–1656, Electrical Engineering, Incheon National University,
Aug. 2016. South Korea. His research interests include
[13] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. microgrid operation and energy management
Cambridge, MA, USA: MIT Press, Oct. 2018. system.
[14] L. Panait and S. Luke, “Cooperative multi-agent learning: The state of
the art,” Auton. Agents Multi Agent Syst., vol. 11, no. 3, pp. 387–434,
Nov. 2005.
[15] E. Kuznetsova, Y.-F. Li, C. Ruiz, E. Zio, G. Ault, and K. Bell,
“Reinforcement learning for microgrid energy management,” Energy, Akhtar Hussain (S’14) received the B.E. degree
vol. 59, pp. 133–146, Sep. 2013. in telecommunications from the National University
[16] F.-D. Li, M. Wu, Y. He, and X. Chen, “Optimal control in microgrid of Sciences and Technology, Pakistan, in 2011,
using multi-agent reinforcement learning,” ISA Trans., vol. 51, no. 6, and the M.S. degree in electrical engineering from
pp. 743–751, Nov. 2012. Myongji University, Yongin, South Korea, in 2014.
[17] S. Dutta, Reinforcement Learning With TensorFlow: A Beginner’s Guide He is currently pursuing the Ph.D. degree with
to Designing Self-Learning Systems With TensorFlow and OpenAI Gym. Incheon National University, South Korea. He was
Birmingham, U.K.: Packt, Apr. 2018, pp. 1–336. an Associate Engineer with SANION; IEDs devel-
[18] V.-H. Bui, A. Hussain, and H.-M. Kim, “Q-learning-based operation opment company, South Korea, from 2014 to 2015.
strategy for community battery energy storage system (CBESS) in His research interests are distribution automation and
microgrid system,” Energies, vol. 12, no. 9, pp. 1789–1806, May 2019. protection, smart grids, and microgrid optimization.
[19] V. Mnih et al., “Playing Atari with deep reinforcement learn-
ing,” arXiv preprint arXiv:1312.5602, Dec. 2013. [Online]. Available:
https://arxiv.org/pdf/1312.5602.pdf
[20] E. Mocanu et al., “On-line building energy optimization using deep Hak-Man Kim (SM’15) received the first
reinforcement learning,” IEEE Trans. Smart Grid, vol. 10, no. 4, Ph.D. degree in electrical engineering from
pp. 3698–3708, Jul. 2019. Sungkyunkwan University, South Korea, in 1998,
[21] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” and the second Ph.D. degree in information
in Proc. Int. Conf. Mach. Learn., Jun. 2016, pp. 1928–1937. sciences from Tohoku University, Japan, in 2011.
[22] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning He was with Korea Electrotechnology Research
with double Q-learning,” in Proc. AAAI, vol. 2, Feb. 2016, p. 5. Institute, South Korea, from 1996 to 2008. He
[23] M. Wiering and M. O. Ma, Reinforcement Learning State-of-the-Art. is currently a Professor with the Department of
Berlin, Germany: Springer-Verlag, 2012. Electrical Engineering, Incheon National University,
[24] Y. Mohan, S. G. Ponnambalam, and J. I. Inayat-Hussain, “A comparative South Korea. His research interests include
study of policies in Q-learning for foraging tasks,” in Proc. IEEE Nat. microgrid operation and control and dc power
Biol. Inspired Comput., Dec. 2009, pp. 134–139. systems.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.