Double Deep Q-Learning-Based Distributed Operation of Battery Energy Storage System

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO.

1, JANUARY 2020 457

Double Deep Q-Learning-Based Distributed


Operation of Battery Energy Storage
System Considering Uncertainties
Van-Hai Bui, Student Member, IEEE, Akhtar Hussain , Student Member, IEEE,
and Hak-Man Kim , Senior Member, IEEE

Abstract—Q-learning-based operation strategies are being within the MG [3]–[5]. However, with an increase in the num-
recently applied for optimal operation of energy storage systems, ber of control devices, the computational burden on the central
where, a Q-table is used to store Q-values for all possible state- EMS and communication complexity of the network increases.
action pairs. However, Q-learning faces challenges when it comes
to large state space problems, i.e., continuous state space prob- Moreover, the centralized management entails a single point
lems or problems with environment uncertainties. In order to of failure, and central unit fault will cause the breakdown
address the limitations of Q-learning, this paper proposes a dis- of the entire system. Low flexibility/expandability is another
tributed operation strategy using double deep Q-learning method. critical limitation of the centralized management [6]. In addi-
It is applied to managing the operation of a community battery tion, since all entities in the system are interconnected and
energy storage system (CBESS) in a microgrid system. In con-
trast to Q-learning, the proposed operation strategy is capable of may have their own operation objectives. Therefore, the
dealing with uncertainties in the system in both grid-connected centralized EMS could have practical limitations for the
and islanded modes. This is due to the utilization of a deep neu- energy scheduling of MGs having diverse objectives and
ral network as a function approximator to estimate the Q-values. ownerships [7]. Therefore, recently, decentralized energy man-
Moreover, the proposed method can mitigate the overestimation agement systems are gaining popularity due to their ability
that is the major drawback of the standard deep Q-learning.
The proposed method trains the model faster by decoupling the to overcome the limitations of centralized EMSs [6], [8], [9].
selection and evaluation processes. Finally, the performance of the A diffusion strategy-based distributed operation of MGs has
proposed double deep Q-learning-based operation method is eval- been proposed to minimize the total operation cost by using
uated by comparing its results with a centralized approach-based a multiagent system [8]. A fully distributed control strategy
operation. has been proposed based on the consensus algorithm for
Index Terms—Artificial intelligence, battery energy storage the optimal resource management in an islanded MG [9].
system, double deep Q-learning-based operation, microgrid oper- However, in these methods, an exact mathematical model
ation, Q-learning, optimization. is required for operation of the MG system. It could be
challenging to develop exact mathematical models for com-
plex systems. Moreover, these methods are the iteration-based
I. I NTRODUCTION optimization, i.e., agents need to exchange information several
NERGY management system (EMS) have been widely times to achieve an optimal result. This type of iterative oper-
E applied for the operation of microgrids (MGs) in both
grid-connected and islanded modes. The major responsibility
ation could be problematic if there is any change in the system
parameters. These methods could take a long time for finding
of the EMS is to optimize the operation of the microgrid’s a new optimal point and might not be suitable for real-time
resources with a defined objective [1], [2]. There are two fun- optimization problems.
damentally different approaches to the design of EMS frame- To overcome the disadvantages of the conventional decen-
works, which are centralized and decentralized approaches. In tralized approaches, reinforcement learning (RL)-based decen-
the centralized approach, a centralized EMS is provided with tralized energy management has been introduced as a potential
the relevant information of all the distributed energy resources solution for optimal operation of MGs [10]–[12]. In RL,
agents learn to achieve a given task by interacting with their
Manuscript received January 16, 2019; revised April 1, 2019 and May environment. Since the agents do not require any model of the
9, 2019; accepted June 14, 2019. Date of publication June 20, 2019; environment, they only need to know the existing states and
date of current version December 23, 2019. This work was supported by
the Korea Institute of Energy Technology Evaluation and Planning and possible actions in each state. This method drives the learn-
the Ministry of Trade, Industry and Energy of South Korea under Grant ing process on the basis of penalties or rewards assessed on
20168530050030. Paper no. TSG-00082-2019. (Corresponding author: Hak- a sequence of actions taken in response to the environment
Man Kim.)
The authors are with the Department of Electrical Engineering, dynamics [13], [14].
Incheon National University, Incheon 402-749, South Korea (e-mail: Q-learning being a popular method in RL, is widely used for
hmkim@inu.ac.kr). optimal operation of MGs [15], [16]. In Q-learning, a Q-table
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. is used to represent the knowledge of the agent about the envi-
Digital Object Identifier 10.1109/TSG.2019.2924025 ronment. The Q-value for each state and action pair reflects
1949-3053 c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
458 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020

proved that the proposed method can get similar results with
those of a centralized EMS, despite being a decentralized
approach. The major contributions of this study are listed
as following.
• A distributed operation strategy is proposed for CBESS
using DDQN method in both grid-connected and islanded
modes with different objectives. Due to the use of dif-
ferent DNNs, the CBESS can be continuously trained
Fig. 1. Microgrid configuration. with a new data set in real-time operation to improve
the decision accuracy.
• The proposed operation strategy is capable of handling
the future reward associated with taking such an action in the system uncertainties in the environment. The charg-
this state. However, Q-learning-based operation methods are ing/discharging decisions of CBESS are adjusted with
only suitable for the problems with small state space. They the actual data by using a target-network with optimal
are not suitable for a continuous state space or for an environ- parameters.
ment with uncertainties [17], [18]. Therefore, a model which • A comparison between the proposed method and other
maps the state information provided as input to Q-values methods is presented to evaluate the performance of
of the possible set of actions should be developed. For the proposed algorithm. Simulation results have proved
this reason, a deep neural network (DNN) comes to play that the proposed method can get similar results with
the role of a function approximator, which can take state those of a centralized EMS, despite being a decentralized
information input, and learn to map them to Q-values for all approach.
possible actions. Combining Q-learning with DNN leads to • The performance of the proposed method is also tested
a method called deep Q-learning or deep Q networks (DQN), for a longer time span (48-hour scheduling horizon).
which enhances the performance of Q-learning for large scale The results show that the proposed can perform better
problems [19]–[21]. In DQN, same weight values are used in case of a longer time span. The scalability of the
for both selection and evaluation of an action. It can lead proposed method is also discussed for a larger scale, e.g.,
to overoptimistic value estimates [17], [22]. To mitigate this multi-microgrid system.
problem, double deep Q-learning (DDQN) has been developed
to decouple the selection from the evaluation. In DDQN,
a primary network is used to choose the action and a tar-
II. S YSTEM M ODEL
get network is used to generate a target Q-value for that
action [22]. Therefore, a DDQN-based operation is suitable A. System Configuration
for the agent with a large state space and the agent can avoid Fig. 1 describes the system configuration, which is com-
being trapped in local optimization. prised of an MG and a CBESS. The MG considered in this
Due to the advantage of DDQN, a distributed operation study consists of a controllable distributed generator (CDG),
strategy for managing the operation of a community bat- a renewable distributed generator (RDG), a battery energy
tery energy storage system (CBESS) in an MG system is storage system (BESS) as the energy storage device, and res-
proposed using DDQN. The MG system is comprised of an idential loads. The CBESS is a community battery energy
MG and a CBESS, where a microgrid EMS (MG-EMS) is storage system with different ownership from the MG [7]. The
used for optimizing the operation of the MG and a DDQN- CBESS could be owned by any electrical company in a local
based operation is used for CBESS. In grid-connected mode, electric power distribution network. The main purpose of the
the actions of the CBESS are only dependent on the mar- CBESS is to maximize the profit in normal mode and improve
ket price signals. The effect of uncertainty of market price the reliability of system during emergencies. In grid-connected
signals is analyzed in this mode. On the other hand, in mode, power can be traded among MG, CBESS, and the util-
islanded mode, the actions of CBESS are dependent on ity grid to minimize the total operation cost. In islanded mode,
both the trading price and surplus/shortage amount of the CBESS and MG can trade power to minimize load shedding in
MG. Due to the uncertainty of load and renewable resources the network. The MG is operated by an MG-EMS for minimiz-
in the MG, the uncertainties of surplus and shortage power ing its total operation cost. MG-EMS communicates with the
of the MG are also considered in this study. In contrast to utility grid to get the buying/selling price signals and decides
Q-learning, the proposed operation strategy is capable of deal- the amount of power to be traded with the utility grid and
ing with uncertainties in the system in both grid-connected and CBESS in grid-connected mode. However, in islanded mode,
islanded modes. Moreover, the DDQN decouples the selec- the MG cannot trade power with the utility grid. Thus, the
tion and evaluation of an action, thus the CBESS can avoid MG cooperatively operates with the CBESS to minimize the
the trapping into local optima. Finally, in order to analyze load shedding amount.
the effectiveness of the proposed DDQN-based optimization The operation of the CBESS is based on double deep
method, the operation results of the proposed method are Q-learning method for both grid-connected and islanded
compared with the conventional centralized EMS results, modes. The detailed optimization model for CBESS is
and other learning methods’ results. Simulation results have explained in the following sections.

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 459

B. Backgrounds of Q-Learning and Deep Q-Learning network is used to compute loss after every action taken by
Methods the agent during training time. The weight parameters θ (i − 1)
In this section, the backgrounds of Q-learning and deep belongs to the Target-network and θ (i) belongs to the Q-
Q-network are introduced. network. The actions of the agent are selected according to
1) Q-Learning: Q-learning is a popular model-free rein- the behavior policy μ(a|s). On the other side, the greedy tar-
forcement learning, where an agent explores the environment get policy π(a|s) selects only actions a that maximize Q(s, a),
and finds the optimal way to maximize the cumulative reward. that is used to calculate the TD-target. This technique is called
In Q-learning, the agent only needs to know its state space and fixed Q-targets. In fact, the weights are updated only once in
possible actions. Each state-action pair is assigned by a Q- a while.
value, which represents the quality of the action for the given Important feature added to deep Q-network is experience
state. When the agent carries out an action from the given state, replay. It means that the agent can store its past experiences
it receives a reward and moves to a new state. The reward is and use them in batches to train DNN. Storing the experi-
used to update the Q-value using Bellman equation [13], [17] ences allows the agent to randomly draw the batches and help
as follows. the network to learn from a variety of data instead of just
  formalizing decisions on immediate experiences. In order to
  
Q(s, a) ← Q(s, a) + α. r + γ . max Q s , a − Q(s, a) avoid storage issues, the buffer of experience replay is fixed,
a

and as the new experiences get stored the old experiences get
s←s (1) removed. For training deep neural networks, uniform batches
where: of random experiences are drawn from the buffer.
Q(s,a): Q-value for the current state s and action a pair 3) Double Deep Q-Learning: In standard DQN and
α: learning rate of convergence Q-learning, the “max” operator uses the same weight val-
γ: discount factor of future rewards ues both to select and to evaluate an action [13], [17]. This
r: immediate reward. makes it more likely to select overestimated values, resulting
The agent explores the environment for a large number in overoptimistic value estimates. To prevent this, the selec-
of episodes. In each episode, the agent starts from an initial tion should be decoupled from the evaluation. This is the
state and performs several actions until reaching the goal state idea behind double deep Q network (DDQN).
and updates its knowledge (Q-values). The choosing action is In DDQN, instead of taking the “max” over Q-values while
based on the ε-greedy policy, which is a way of selecting ran- computing the target Q-value during training, we use a primary
dom actions with uniform distribution from a set of available network to choose the action and Target-network to gener-
actions [23], [24]. Using this policy, the agent can select an ate a target Q-value for that action. By decoupling the action
action with the current optimal policy with probability (1-ε) choice from the target Q-value generation, it is able to reduce
or try other actions with probability ε in a hope for a bet- the overestimation and helps to train faster and more reliably.
ter reward resulting in improvement in Q-values and thereby, The TD-Target is as follows:
deriving better optimal policy.     
2) Deep Q-Learning: In Q-learning method, a Q-table per- yi := Eα  ∼μ r + γ Q s , arg max Q s , a; θi ; θi−1 |St = s, At = a
a
forms well in an environment where the state space is small. (3)
However, with an increase in the state space size, Q-learning
becomes intractable [18]. Therefore, a model that maps the It can be observed that the Target-network with parame-
state information provided as input to Q-values of the possi- ters θ (i − 1) evaluates the quality of the action, the action
ble set of actions should be developed. For this reason, a deep itself is determined by the Q-network that has parameters θ (i).
neural network (DNN) comes to play the role of a function The calculation of new TD-Target yi can be summarized in
approximator, which can take state information input in the the following steps: Q-network uses next state s to calculate
form of a vector, and learn to map them to Q-values for all qualities Q(s , a) for each possible action a in state s . The
possible actions. Since a DNN is being used as a function “argmax” operation applied on Q(s , a) chooses the action a*
approximator of the Q function, this process is called deep that belongs to the highest quality. The quality Q(s , a∗ ) is
Q-learning. The main objective is to minimize the distance determined by the Target-network.
between Q(s, a) and TD-target (temporal difference target). Deep learning techniques require large amount of train-
This objective can be expressed as minimization of the square ing data sets and it is challenging to generate the training
error loss function, as shown in (2). data sets in many cases [13], [17]. However, in DDQN, the
  agent explores the environment and makes the data set by sav-
Li (θi ) = Eα∼μ (yi − Q(s, a; θi ))2 ing its experiences in a memory. In each episode, the training
 
   data set is randomly selected from the memory to optimize the
where yi := Eα  ∼π r + γ max Q s , a ; θi−1 |St = s, At = a DNN and estimate the Q values. Therefore, generation of sep-
a
arate training data set is not required for DDQN. Moreover,
(2)
the agent always updates its memory with new experiences
The TD-target yi and Q(s, a) are estimated separately by two and keeps on training based on the continuous feedback to
different neural networks, which are often called the Target and maximize the reward. In a real-time problem, the agent can
Q-networks. The Q-values generated by this separate Target use the Target-network to estimate the Q-values while the

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
460 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020

Q-network can be continuously trained to get more accurate Algorithm 1: Reward/Penalty for CBESS in Grid-
decisions. Connected Mode
input: s = [interval, SOC, price ± u1]
select an action a based on ε-greedy policy
C. A Comparison Between DDQN and Other with probability ε select a random action
Optimization Methods otherwise select a = argmaxa , {Q(s, a’)}
carry out action a
Dynamic programming (DP) is a popular method for Pcharge_ max = min{Pmax .(SOCmax -SOC), Pcharramp_ max }
optimization problems [25]. DP solves for the optimal pol- Pdischarge_ max = min{Pmax .(SOC-SOCmin ), Pdisramp_ max }
icy or value function by recursion. It requires knowledge of if a = “charge” then:
the markov decision process (MDP) or a model of the envi- if SOC = SOCmax then:
ronment so that the recursions can be carried out. Thus, it is r = a high penalty
else:
lumped under “planning” instead of “learning” and its objec- r = −Pcharge .price
tive is to determine the optimal actions with the given MDP. If end
the model of the environment is not clearly known, the use of else if a = “idle” then:
DP could be challenging. r=0
else:
On the other hand, reinforcement learning (RL) doesn’t
if SOC = SOCmin then:
require knowledge of the model of the environment. It is r = a high penalty
iterative and simulation-based and learns by bootstrapping, else:
i.e., the value of a state or action is estimated using the values r = Pdischarge .price
of other states or actions. Recently, various RL methods are end
end
also available in the literature, e.g., optimal policy RL [13], observe new state s = [interval2, SOC2 , price2 ± u1] and r
path integral control approach to RL [26], or actor-critic (AC)
RL [13], [27].
Among these methods, AC RL is similar to DDQN, which
is used in this study. DDQN is categorized as the value
function-based algorithms. An agent learns action values and
then acts mostly greedily with respect to those values. The
policy is inherently given by the value function itself. Actor-
critic methods [27] is a type of policy gradient algorithms.
AC method implements generalized policy iteration where the
actor aims at improving the current policy and the critic eval-
uates the current policy. Two different neural networks, i.e., Fig. 2. Double deep Q-learning principle diagram.
critic network and actor network are used in AC method. The
critic uses an approximation architecture and simulation to
learn a value function, which is then used to update the actor’s could be changed any time. It leads to increase in the num-
policy parameters in a direction of performance improvement. ber of states of CBESS significantly. Therefore, a deep neural
However, AC also have some limitations, especially for the network is used as a function approximator, which can take
under-consideration problem in this study. There are two main state information and learn to map them to Q-values for all
advantages of using DDQN instead of actor-critic algorithm possible actions. In this section, a DDQN-based distributed
in this study are as following. operation of CBESS is introduced in detailed to estimate the
• Actor-critic method is useful for a continuous action optimal action of the CBESS. The overall deep Q-learning
space problem [27]. However, in this paper, a BESS principle diagram for the CBESS is summarized in Fig. 2.
model is presented with three possible actions, i.e., The CBESS agent starts from an initial state s. The CBESS
charge, discharge, and idle modes. DDQN is easier to carries out several actions and receives rewards according
implement for the stated problem. to the actions and learns by replay from its experiences.
• DDQN method guarantees a global optima. However, as CBESS can be either in charging, discharging or idle mode
the optimization of actor is a gradient-based method, it in each state. Thus, SOC of CBESS can also be increased,
is easy to get trapped in local optimal result [27]. It take decreased, or kept the same with the previous state depending
a lot of training time to get a global optimal result. on the chosen action. The rewards/penalties of performing an
To evaluate the performance of the proposed method, a com- action for CBESS are summarized in Algorithms 1 and 2 for
parison between two different learning methods is presented grid-connected and islanded modes, respectively.
for operation of the CBESS in numerical results section. 1) Reward Function for CBESS: Algorithms 1 and 2 play
an important role in determining the reward function for the
CBESS in different operation modes. These functions define
D. The Proposed DDQN-Based Operation of CBESS what are the good/bad events for the CBESS. They also map
In this study, the uncertainties in market price signals and each state-action pair to a single number, a reward, indicating
in amount of surplus/shortage power are considered for opera- the intrinsic desirability of that state [13]. The CBESS’s objec-
tion of the CBESS. Due to these uncertainties, the environment tive is to maximize the total reward it receives in the long run.

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 461

Algorithm 2: Reward/Penalty for CBESS in Islanded Algorithm 3: DDQN-Based Operation Strategy for
Mode CBESS
input: s = [interval, SOC, price±u1, surplus±u2, shortage±u3] determine operation mode of CBESS
select an action a based on ε-greedy policy initialize replay memory D and mini-batch size
with probability ε select a random action initialize action-value functions with random weights for
otherwise select a = argmaxa ,{Q(s, a’)} Q-network (θ) and target-network (θ  = θ)
carry out action a for episode = 1, N do:
Pcharge_ max = min{Pmax .(SOCmax -SOC), surplus, Pchar ramp_ max }

initialize initial state s1 :


Pdischarge_ max = min{Pmax .(SOC-SOCmin ), shortage, Pdis s1 = [interval1 , SOCint , price1 ] if grid-connected mode
ramp_ max }
if a = “charge” then: s1 = [interval1 , SOCint , price1 , surplus1 , shortage1 ] otherwise
if SOC = SOCmax then: for t = 1, T do:
r = a high penalty select an action at given state st using ε-greedy policy
else: carry out action at observe rt and next state st+1
r = −Pcharge .price store transition (st , at , rt , st ) in D
end sample random mini-batch of transitions (sj , aj , rj ,
else if a = “idle” then: sj+1 ) from D
r=0 estimate
else:
the target yj :
rj if terminalsj+j
if SOC = SOCmin then: yj =
r = a high penalty rj + γ Q(sj+1 , argmaxa (Q(sj+1 , a; θ)); θ  ) otherwise
else: perform a gradient decent step on(yj − Q(sj , aj ; θ ))2
r = Pdischarge .price with respect to θ every C steps reset θ  = θ
end end
end end
observe new state:
s = [interval2 , SOC2 , price2 ±u1, surplus2 ±u2, shortage2 ±u3]
and r

of shortage power. Therefore, the input s is added with two


more features, which are surplus and shortage power of the
In each given state, the CBESS determines the maximum MG. Due to the uncertainties of load and renewable resources
amount of charging/discharging power based on the current in the MG system, the uncertainties of the surplus/shortage
SOC of CBESS. After an action is selected and carried out by power of MG are also considered in this study.
the CBESS, an immediate reward is returned to the CBESS In grid-connected mode, the price is taken from market
agent. The reward depends on the time interval, current SOC price signal while in islanded mode, the price is decided based
and the market price signal at that interval. In charging mode, on the percentage of critical load in each interval. Thus, the
the CBESS receives a negative reward (penalty) for buying intervals having shortage power and high amount of critical
power. However, in case the SOC of CBESS reaches the upper load will propose a higher buying price than the intervals hav-
bound, it could receive a high penalty for this wrong decision ing a lower amount of critical load for reducing the shedding
to reduce the chance to do the same in future. Contrarily, the amount of critical load in the system.
CBESS can receive a positive reward for discharging mode. Moreover, due to the uncertainties (i.e., letter ui in
Similarly, it also faces a high penalty for this mode when- Algorithms 1, 2) in the MG system, the trading prices and the
ever the SOC of CBESS reaches the lower bound. Finally, in amount of surplus/shortage power could be randomly changed
case of idle mode, the CBESS keeps its power for future and between lower and upper bounds of these uncertainties. This
there is no reward for this action. After carrying out an action, may lead to increase in the number of states to an infinite num-
the CBESS moves to a new state with updated information. ber. Consequently, the immediate reward receiving by carrying
The main differences in the modeling between grid- out an action from given state could also be changed at every
connected mode and islanded mode are the number of features time step during an episode.
of a state and price for the trading power. In grid-connected 2) Algorithm for Operation of CBESS: The detailed algo-
mode, each state (input) is a vector s with three features: rithm for operation of CBESS using DDQN is presented in
interval, SOC of CBESS, and market price signal at the current Algorithm 3. Firstly, the operation mode of CBESS is deter-
interval. In each episode, the CBESS agent always starts from mined, i.e., grid-connected or islanded mode. Based on the
an initial state with an initial SOC and the first interval. The operation mode, the number of features for a state is deter-
CBESS is operated for a 24-hour scheduling horizon and each mined for the CBESS. The CBESS is trained with large
time interval is set to be 1 hour. Therefore, the first interval number of episodes to optimize the parameters of its deep
is usually taken as t = 1. Based on the action selection, neural networks. As mentioned earlier, there are two deep
reward/penalty is determined considering the uncertainty of neural networks, i.e., a Q-network and a Target-network.
market price signals. In islanded mode, the CBESS operation These networks play an important role in the operation of
is also constrained by the amount of surplus/shortage power. the CBESS. They help the CBESS in selection and evaluation
For example, the maximum chargeable amount in islanded of an action. Therefore, optimal parameters of these networks
mode is constrained by the amount of surplus power and sim- are necessary for operation of the CBESS. In this case, gra-
ilarly, the dischargeable amount is constrained by the amount dient decent method is used for minimization of the square

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
462 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020

Fig. 3. Flowchart for operation strategy: (a) MG-EMS; (b) CBESS.

error loss function with respect to network parameters of the operation mode. The agent carries out the action and receives
Q-network (θ ). After a constant time step, all the parameters a reward and moves to a new state. This process is explaned
of the Target-network are replace by the Q-network’s param- in detail in Algorithms 1 and 2. After finishing a transition
eters. Finally, with the optimal parameters, they can guide the from s to s’, the transition information {s,a,r,s’} is stored in
CBESS for selecting an optimal action in a given state. the replay memory D. A sample random mini-batch is taken
from memory D. Then the agent estimates the actual/target
E. Operation Strategy for MG and CBESS y for the experience and performs an optimization step to
minimize the square loss function by using gradient decent
In this section, the detailed operation strategies for MG and
method with respect to Q-network’s weights (θ ). The weight
CBESS are presented, as shown in Figs. 3(a), (b), respectively.
parameters of the Q-network is updated after each training
These strategies are implemented for both grid-connected and
step and the optimal network is used to select the action of
islanded modes. In grid-connected mode, MG is connected to
CBESS. After a constant number of steps, the weight parame-
the utility grid. The hourly day-ahead market price signals,
ters of Target-network are reset by Q-network’s parameters.
generation capabilities of CDG, RDG, and BESS along with
The Target-network is used to evaluate the quality of the
load profiles of MG system are taken as inputs. The MG-
selected action. Finally, the optimal action and the amount of
EMS receives this information and performs optimization for
charging/discharging power are determined by using the DNN
all participating components to minimize the total operation
with the optimal weight parameters. The overall process of the
cost. The decision about surplus and shortage amounts is based
CBESS is summarized in Fig. 3(b).
on the generation cost of the CDG unit, market trading prices,
After gathering the information from CBESS and the util-
and local load demands of MG.
ity grid, MG-EMS decides the amount of buying/selling power
However, in the islanded mode, MG is disconnected from
from/to the utility grid and CBESS and informs the optimal
the utility grid. The main objective is to minimize the amount
results to its components. However, in islanded mode, there
of load shedding in the MG system. The amount of sur-
is no connection to the utility grid. Load shedding could be
plus/shortage power is now only based on the load profiles
implemented to maintain the power balance. After perform-
and the available generated power. Finally, The MG-EMS will
ing power sharing with the CBESS, the MG-EMS reschedules
send its information of surplus and shortage power to exter-
the operation of all the components based on the charg-
nal systems and wait for feedback information. The detailed
ing/discharging amount from CBESS. Finally, a load shedding
process is summarized in Fig. 3 (a).
scheme is implemented for maintaining the power balance in
On the CBESS side, the detail DDQN-based operation strat-
the whole system in case of having shortage power.
egy for CBESS is presented in Fig. 3(b). Firstly, the weight
parameters of deep neural networks (DNNs) (i.e., Q-network
and Target-network) are randomly initialized and the replay F. Operation of CBESS in a Multi-Microgrid System
memory is also initialized with size D. In each episode, the In a large area, there may be several MGs. They are inde-
agent starts from the initial state and selects an action based on pendently operated in normal operation. However, the uncer-
ε-greedy policy. The state of CBESS is changed based on the tainties have a high effect on the operation of a single MG. The

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 463

The constraints associated with CDGs include (7)-(12).


Constraint (7) shows the upper and lower operation bounds
of CDGs. Equation (8) gives the on/off status of CDGs. The
start-up and shut-down modes are determined by using con-
straints (9), (10) based on the on/off status of CDGs. The
bounds for ramp-up/ramp-down rates of CDGs are enforced
by (11), (12), respectively.

Fig. 4. Extension to a multi-microgrid system. ui,t · Pmin


i ≤ PCDG
i,t ≤ ui,t · Pmax
i ∀i ∈ I, t ∈ T (7)

1 CDG is on
ui,t = ∀i ∈ I, t ∈ T (8)
0 CDG is off
amount of load shedding could also be high during islanded   
yi,t = max ui,t − ui,t−1 , 0 ∀i ∈ I, t ∈ T (9)
operation and may result in reduction of system reliability.   
Recently, the concept of multi-microgrid (MMG) is introduced zi,t = max ui,t−1 − ui,t , 0 ∀i ∈ I, t ∈ T (10)
as a potential solution to overcome these problems [7], [28]. PCDG
i,t − PCDG
i,t−1 ≤ RUi · (1 − yi,t ) + max{RUi , Pmin
i } · yi,t ∀i ∈ I, t ∈ T
MMG system is formed by connecting several neighboring (11)
MGs to exchange power among MGs. In case a CBESS con-
PCDG
i,t−1 − PCDG
i,t ≤ RDi · (1 − zi,t ) + max{RDi , PCDG
i,t−1 } · zi,t ∀i ∈ I, t ∈ T
nects with an MMG system, as shown in Fig. 4. Our proposed
method should be extended to optimal operation of the whole (12)
system in a distributed manner.
The power balance between the power sources and power
The internal trading in the MMG system is the main con-
demands is given by (13). The buying/selling power is the
cern for this extension. A retailer agent is introduced to solve
amount of power trading with the external systems, which is
the problem of internal trading. It provides a dynamic internal
divided into trading with the utility grid or the CBESS, as
trading prices for MMG system to maximize the total internal
given in (14), (15), respectively.
trading amount. In order to optimize the dynamic internal trad-
Buy
ing prices, a DDQN-based decision making for the retailer
t + Pt
PPV + + Pt + PB−
WT
PCDG
i,t t
could be suitable for this issue based on the amount of i∈I
surplus/shortage power in each individual MG. Finally, the
= Pl,t + PSell
Load
t + PB+
t ∀t ∈ T (13)
amount of surplus/shortage power is updated after performing
l∈L
internal trading among MGs, as shown in (4), (5).
PSell
t = PtSell_Grid + PSell_CBESS
t ∀t ∈ T (14)

N
N Buy
Pt = Pt
Buy_Grid Buy_CBESS
+ Pt ∀t ∈ T (15)
Psurplus,t = Psurplus,n,t − sell,n,t ∀t ∈ T
Pint (4)
n=1 n=1 The constraints related to BESS are given by (16)-(20).

N N Constraints (16), (17) show the maximum charg-
Pshortage,t = Pshortage,n,t − buy,n,t ∀t ∈ T
Pint (5) ing/discharging power of the BESS considering the ramp
n=1 n=1 rates. SOC is updated by (18) after charging/discharging
After solving the internal trading problem, the remaining power at each interval of time. Equation (19) shows that the
surplus/shortage power can be traded with the CBESS and the value of SOC is set by initial SOC at the first interval of
utility grid. Thus, the problem becomes same with the single time (t = 1). The operation bounds of BESS are enforced
MG problem and can be solved by using the proposed method. by (20).
The optimal model for dynamic internal trading prices in  
Cap   1
0 ≤ PB+ ≤ min PB · SOCmax
B
− SOCt−1
B
· , PB+ ∀t ∈ T
MMG system is a suitable future extension of this study. In t
1 − LB+ Ramp,max
the next section, a detailed mathematical model for MG system (16)
 
are presented. Cap  
0≤ PB−
t ≤ min PB · SOCt−1
B
− SOCmin
B
· (1 − LB− ), PB−
Ramp,max ∀t ∈ T

(17)
 
G. Mathematical Model for Operation of MG 1 1
SOCtB = B
SOCt−1 − · · PB−
t − Pt · (1 − L )
B+ B+
∀t ∈ T (18)
In this section, a mixed integer linear programing (MILP)- PB
Cap 1 − LB−
based formulation of MG system is presented for both B
SOCt−1 = SOCini
B
if t = 1 (19)
grid-connected and islanded modes. In grid-connected mode, B
SOCmin ≤ SOCtB ≤ B
SOCmax ∀t ∈ T (20)
the objective function (6) is to minimize the total operation
cost associated with the fuel cost, start-up/shut-down cost of In centralized approach for CBESS, a battery energy man-
CDGs, and cost/benefit of purchasing/selling power from/to agement system (BEMS) is developed for maximizing its
the utility grid, as shown in equation (6). profit by optimal charging/discharging decisions. The objective
⎧   ⎫ function is expressed by (21).

⎨ CiCDG · PCDG + yi,t · CiSU + zi,t · CiSD ⎪



i,t
i∈I t∈T  
Min   (6) CB− CB+
· PRt − PGrid,t · PRt
Sell Buy

⎩ +
Buy
PRt · Pt
Buy
− PRSell
t · PSell
t


Max Pt (21)
t∈T t∈T t∈T

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
464 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020

In islanded mode, the system is disconnected from the


utility grid. MG system can only trade its surplus/shortage
power with CBESS. In some peak load intervals, MG and
CBESS could not fulfill the power demand in the system.
Therefore, load shedding could be performed to keep the
power balance in the system. In order to reduce the load
shedding amount, MG-EMS performs optimization for min-
imizing both the total operation cost and the load shedding
amount. The cost objective function is changed to (22) with
generation cost and penalty cost for load shedding. The power
balance of power sources and power demands is given by (23)
for islanded mode. Additionally, the objective function (22)
is also constrained by (7)-(12) and (16)-(20). Finally, the
maximum surplus power is determined by (24) for each
interval.


Min CiCDG · PCDG
i,t + yi,t · CiSU + zi,t · CiSD Fig. 5. Test MG system: based on the CERTS MG test-bed [33].
i∈I t∈T

 pen

+ Ct · PShort
t (22)
t∈T
shortage power.
 
t + Pt
PPV + i,t + Pt = + PB+
t + Pt
WT
PDG B−
PLoad Sur
Cap
l,t 0 ≤ PCB+
t ≤ min PCB · SOCmax
CB
−SOCt−1
CB
i∈I l∈L

− PShort ∀t ∈ T (23) 1 CB+
t
  × , PSur
, P ∀t ∈ T (28)
1−LCB+ t Ramp,max


PSur = PPV + PWT + PDG + PB− − PLoad − PShort Cap  
max,t t t i,max t l,t t P CB −SOCCB · (1−LCB− ),
· SOCt−1
i∈I 0 ≤ PCB−
t ≤ min CB min ∀t ∈ T.
l∈L
PShort
t , PCB−
Ramp,max
− PB+
t ∀t ∈ T (24) (29)
Similarly, in centralized approach, the objective of BEMS
is to minimize the amount of load shedding in MG system. H. Scheduling and Optimal Power Flow
The objective function is presented by (25).

 The operation of a power system is generally carried out in
 two steps, i.e., optimal scheduling and optimal power flow. In
Min Short
Pt · pent
Short
(25) the first step, the commitment statuses of the components are
t∈T
determined including the power exchange with the main grid.
In this paper, a distributed operation strategy using DDQN In the second step, the status of components is shared with the
is proposed for optimal operation of CBESS. In both opera- optimal power flow problem and is allowed to deviate minutely
tion modes, the objective of CBESS is to maximize its profit from the set values. In this step, the network losses are deter-
by trading power with the MG and the utility grid. However, mined and are sent back to the first step problem. Scheduling
the charging/discharging bounds of CBESS is different in problem is again solved including the network losses. This
each operation mode. In grid-connected mode, the amount of process is repeated in an iterative way till the convergence is
charging/discharging for CBESS is totally dependent on the achieved.
market price signals. Maximum charging/discharging power However, MG is localized and small-scale power system and
of CBESS is given by (26), (27) considering the ramp rate the length of lines is short. Therefore, the resistance of the MG
constraints. lines between two buses is very small and the power line losses
  
0 ≤ PCB+
Cap
≤ min PCB · SOCmax
CB
− SOCt−1
CB
·
1
, PCB+ ∀t ∈ T
are usually negligible [29], [30]. Therefore, power losses for
t
1 − LCB+ Ramp,max power flow are not considered in this study and the network
(26) constraints, e.g., constraints about voltages and currents are
  
0≤
Cap
≤ min PCB · SOCt−1 − SOCmin · (1 − LCB− ), PCB− assumed to be fulfilled [31], [32]. The detailed analysis on
PCB− Ramp,max ∀t ∈ T
CB CB
t
the operation of CBESS considering the optimal power flow
(27) and network constraints could be a suitable extension of this
On the other hand, in islanded mode, the CBESS can only study for a larger system, e.g., multi-microgrid system.
charge surplus power from MG and discharge to MG during
intervals having shortage power. Constraints (28), (29) show
III. N UMERICAL R ESULTS
the maximum bounds for charging/discharging amount at each
interval of time considering the ramp rates. These constraints A. Input Data
also ensure that charging is only possible when MG has sur- Fig. 5 shows a test MG system based on the CERTS MG
plus power, while discharging is only possible when MG has test-bed [33] considering the uncertainty of environment, e.g.,

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 465

TABLE I
T HE D ETAIL PARAMETERS OF BESS, CBESS, AND CDG

the uncertainty in power of renewable resources, load demand


profiles, and market price signals.
The test MG system is comprised of three convertor based
sources and five load banks. In this study, the three convertor
based sources are a controllable distributed generator (CDG),
a battery energy storage system (BESS), and renewable dis-
tributed generators (RDGs). The detailed parameters of CDG,
BESS, and CBESS are tabulated in Table I.
The MG is interconnected with a CBESS and the util-
ity grid. The system can be operated in both grid-connected
and islanded modes. In this study, the operation bounds of
BESS and CBESS were chosen as [10%, 90%] to elongate
the lifespan of energy storage elements, same as [34]. The
analysis is conducted for a 24-hour scheduling horizon with
a time interval of 1 hour. The DDQN-based model is coded
in Python, which uses TensorFlow library [35]. The MILP-
base model is also code in Python and solved by using the Fig. 6. Input data: (a) Market price signals considering 5% of uncertainty,
MILP solver CPLEX 12.6 [36]. The market price signals, (b) load profile; (c) The output power of RDG.
total load amount, and total output of RDGs are shown in
Figs. 6(a) [37], 6(b) [38], and 6(c) [37], respectively. The detail
numerical results are shown in the following sections for both
grid-connected and islanded modes.

B. Operation of the System in Grid-Connected Mode


This section presents the operation of MG and CBESS in
grid-connected mode. MG-EMS performs optimization to min-
imize the total operation cost of the MG system. The operation
of all components in the MG are shown in Fig. 7. The amount
of trading power is determined based on the amount of sur- Fig. 7. Operation of MG in grid-connected mode.
plus/shortage power in the MG system. It can be observed
that the MG always buys power with the low market price
signals and sells power with high market price signals. The choosing random actions with a small probability of epsilon.
BESS also charges power during low price intervals and dis- Therefore, the value of the total reward is also fluctuated. In
charges power to fulfill load amount or sell to the external order to show the total reward clearer, the total rewards dur-
system during peak price intervals. ing last 100 episodes of training time is shown in Fig. 8(b) for
On the other side, CBESS is operated based on the DDQN both actor-critic RL and DDQN methods. It can be observed
method. The CBESS is trained with 80000 episodes in grid- that the total reward of CBESS converges to optimal points
connected mode considering two cases: without uncertainty for both learning methods. However, the CBESS operation
(case 1) and with ±5% uncertainty of market price signals is easily trapped in a local optima by using actor-critic RL
(case 2). In order to show the performance of the proposed because the optimization of actor is a gradient-based method.
method, the CBESS model is also trained with actor-critic RL Therefore, in this study, a detailed operation of CBESS is only
method considering the same training time and uncertainty of analyzed by using the DDQN method. It can be observed that
market price signals. After training, the optimal weight param- the total rewards of CBESS after 80000 episodes converge
eters of DNNs are used to estimate the actions of CBESS and at an optimal value in cases without uncertainty. With ±5%
evaluate the quality of that action. The convergence of total uncertainty of price signals, the total rewards of CBESS also
reward for CBESS using the proposed method is shown in converge to optimal values based on the real values of the
Fig. 8(a). It can be observed that the action of CBESS is better market price signals in between upper and lower bounds. In
when the number of episode increases. The CBESS keeps on this case, the total rewards could be lower or higher than the

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
466 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020

Fig. 10. The operation of CBESS after 80000 episodes.

Fig. 8. (a) Total reward for CBESS during 80000 episodes; (b) Total rewards
for CBESS during last 100 episodes.

Fig. 11. Market price signals: a test case.

Fig. 9. Accuracy of CBESS after 80000 episodes.


Fig. 12. Optimal operation of CBESS with the uncertainty of market price
signals.
total rewards in case 1 due to the change in market price
signals (±5%).
In order to analyze the accuracy of the proposed method, the of MG’s components and the optimal charging/discharging
accuracies in both cases are summarized in Fig. 9 after training decisions of CBESS, the total operation cost of MG and
time. It can be observed that the accuracies in both cases are total profit of CBESS are summarized in Table II for grid-
acceptable (≈1) for operation of CBESS. In case 2, since the connected mode. The positive value represents the profit, while
price signals are randomly changed every step in an episode, the negative value represents the operation cost.
the state space is much bigger than case 1. Therefore, the train- Moreover, in order to show the performance of the proposed
ing accuracy in case 1 converges faster than case 2. In order to method for handling the uncertainty, a comparison between
guarantee the acceptable accuracy, the CBESS could be trained a conventional EMS (planning without uncertainty) and the
during few hours. However, the training process is off-line proposed method considering the uncertainty of market price
training based on forecasted data for a day-ahead scheduling signals in grid-connected mode is presented. We assumed that
of CBESS. The CBESS can use the Target-network to esti- the real-time data for market price signals, as shown in Fig. 11.
mate the Q-values while the Q-network can be continuously If the CBESS is operated following day-ahead scheduling, the
trained to get more accurate decisions. total profit could be reduced because of the error between the
By using the optimal DNN, the optimal operation of CBESS forecasted and the real-time market price signals. For instance,
is shown in Fig. 10 by using the centralized-based approach the CBESS could decide to charge power in a cheap interval
and the proposed method. By using the proposed method, the (interval 5) based on the forecasted data. However, the price
optimal weight parameters of DNN are saved at the end of the of this interval could not be actually cheap. By using the
training process. Then the optimal DNN is used to estimate proposed method, the CBESS can adjust its schedule based
the optimal actions of the CBESS. It can be observed that on the real-time data, as shown in Fig. 12. It can be observed
the CBESS can optimize its charging/discharging decisions that the CBESS changes the charging/discharging decisions.
with the proposed method, similar to centralized approach. The Therefore, the total profit is kept at higher value. The total
CBESS is charged during the off-peak price intervals (intervals profit for CBESS by following the day-ahead schedule is
2, 4, 5, and 19) and is discharged during peak price intervals 15280 KRW. On the other hand, the total profit is 15720 KRW
(intervals 12-15, and 20). Based on the optimal operation by using the DDQN.

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 467

TABLE II
T OTAL O PERATION C OST OF MG AND T OTAL P ROFIT
OF CBESS IN G RID -C ONNECTED M ODE

Fig. 15. Trading price considering ±5% of uncertainty.

Fig. 13. Optimal operation of MG in islanded mode.


Fig. 16. Total reward for CBESS after 100000 episodes.

Fig. 14. Surplus/shortage power of MG considering ±10% uncertainty.


Fig. 17. Accuracy for CBESS after 100000 episodes.

C. Operation of the System in Islanded Mode total reward of CBESS can also converge at optimal values
In islanded mode, MG cannot trade power with the utility based on the real values of the trading price signals, surplus,
grid. CBESS plays an important role in the operation of MG and shortage amount. In this case, the total reward could be
for reducing load shedding amount. CBESS is used to shift lower or higher than the total reward in case 1 due to the
the surplus power to other intervals having shortage power. It changes in market price signals (±5%) and surplus/shortage
means CBESS charges the surplus power and discharges for power (±10%). The accuracies in both cases are also summa-
fulfilling the shortage power. The operation of all components rized in Fig. 17 after training time. It can be observed that
in MG system is shown in Fig. 13. The MG-EMS controls the accuracies in both cases are acceptable (≈1) for operation
these components to fulfill the load demand and reduces the of CBESS. In case 2, since the trading price signals and sur-
load shedding amount during some peak load intervals. The plus/shortage amount are randomly changed in each step in
detailed amount of surplus/shortage power of MG is shown in between the upper and the lower bounds, the state space is
Fig. 14 in each interval of time. The trading price signals are much bigger than case 1. Therefore, the training accuracy in
shown in Fig. 15, the price is decided based the percentage of case 1 converges faster than case 2.
critical load in each interval. In islanded mode, we assume that By using the optimal DDNs, the operation of CBESS is
the uncertainties of trading price signals and surplus/shortage given in Fig. 18 by the proposed method and the central-
amount are ±5% and ±10%, respectively. ized approach. It can be observed that the CBESS is operated
Due to increase in the number of uncertainty parameters, in an optimal way by both methods. The CBESS is decided
the training episodes are also increased to improve the accu- to charge the surplus power in lower price intervals and dis-
racy of training model. In this study, the model is trained for charged to fulfill the shortage power in high price intervals.
100000 episodes for two cases: without uncertainty (case 1) The load shedding amount of MG system can also be signifi-
and with uncertainty (case 2). Fig. 16 shows the total reward cantly reduced by using the CBESS, as shown in Fig. 19. In
of CBESS during the last 100 episodes of the training time. case of without the CBESS, the load shedding is performed
The total reward of CBESS after 100000 episodes converge at during almost peak load intervals (intervals 7-10 and 15-19).
an optimal value in case 1. On the other hand, in case 2, the With the CBESS, the operation of CBESS is determined by

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
468 IEEE TRANSACTIONS ON SMART GRID, VOL. 11, NO. 1, JANUARY 2020

Fig. 18. The operation of CBESS after 100000 episodes.

Fig. 20. (a) Market price signals; (b) Optimal operation of CBESS during
48-hour scheduling horizon.

Fig. 19. Load shedding amount in MG system. the CBESS is trained to maximize its total profit during that
scheduling horizon, as shown in Fig. 20(b). It can be observed
TABLE III
that the CBESS charges power at the end of the first day with
T OTAL O PERATION C OST OF MG AND T OTAL
P ROFIT OF CBESS IN I SLANDED M ODE lower price and uses for the next day to reduce the operation
cost.
Similarly, in islanded mode, the CBESS is trained for
minimizing the load shedding during the 48-hour scheduling
horizon. By using the proposed method, the amount of surplus
power during the last intervals of the first day can be charged
to the CBESS to use for fulfilling the shortage power in the
next day. Therefore, the load shedding amount could also be
reduced with a longer time scheduling.
The results for longer scheduling horizon (48 hours) are
better than those of the shorter scheduling horizon (24 hours).
However, the results are also dependent on the forecasted data
using the optimal DNN, the amount of load shedding is only and the uncertainties in the system. The accuracy of forecasted
high in intervals 7, 15, and 19. data could be reduced for considering a longer time period.
Similar to the grid-connected mode, thanks to the optimal It leads to increase in the uncertainties in the system. That
DDN, the CBESS can also adjust its decisions in this opera- could make worse situation for the system reliability and oper-
tion mode based on the real-time data of the amount of surplus ation cost, as mentioned in the previous sections. Moreover,
/shortage power. It leads to reduction in the load shedding the complexity of the system also increases in terms of training
amount in MG compared with other conventional methods process whenever any event occurs in the system. Therefore,
(i.e., MILP, Q-learning) and improve the system reliability. a day-ahead scheduling (24-hour time span) is proposed to
Finally, the total operation cost of MG and total profit of preserve the higher reliability of the system.
CBESS are also summarized in Table III for islanded mode.
IV. C ONCLUSION
D. Operation of the CBESS System in a 48-Hour Time Span In this paper, a distributed operation strategy has been
In order to show the effectiveness of the proposed method developed for managing the operation of a CBESS in an
for a longer time span, an extension of this study is presented MG system using double deep Q-learning method. The MG
for a 48-hour scheduling of CBESS using DDQN. The opera- system is comprised of an MG and a CBESS, where an MG-
tion of CBESS is analyzed in both grid-connected and islanded EMS was used for optimizing the operation of the MG. In
modes. The results are also compared with the day-ahead contrast to Q-learning, the proposed operation strategy was
scheduling (24-hour time span). The market price signals are capable of dealing with uncertainties in the system in both
shown in Fig. 20(a) for 48 hours. In grid-connected mode, grid-connected and islanded modes. Moreover, by decoupling
the objective is to maximize the total profit of the CBESS the selection and evaluation of an action, the proposed method
during the time scheduling. Therefore, the CBESS is always reduced the overestimation. Finally, a comparison between the
discharged to the minimum operation bound at the end of the proposed strategy and other methods has been presented for
day for 24-hour time span. However, in a 48-hour time span, showing the effectiveness of the proposed method. The CBESS

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.
BUI et al.: DOUBLE DEEP Q-LEARNING-BASED DISTRIBUTED OPERATION OF BATTERY ENERGY STORAGE SYSTEM CONSIDERING UNCERTAINTIES 469

can optimally work with the proposed operation strategy with [25] D. P. Bertsekas, Dynamic Programming and Optimal Control, vol. 1.
a large number of episodes. The CBESS accurately determined Belmont, MA, USA: Athena Sci., Jan. 1995.
[26] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integral con-
the optimal operation like the centralized-based method in both trol approach to reinforcement learning,” J. Mach. Learn. Res., vol. 11,
grid-connected and islanded modes. pp. 3137–3181, Nov. 2010.
[27] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAM J.
Control Optim., vol. 42, no. 4, pp. 1143–1166, Apr. 2003.
R EFERENCES [28] S. A. Arefifar, M. Ordonez, and Y. A.-R. I. Mohamed, “Energy manage-
ment in multi-microgrid systems—Development and assessment,” IEEE
[1] C. Chen, S. Duan, T. Cai, B. Liu, and G. Hu, “Smart energy management Trans. Power Syst., vol. 32, no. 2, pp. 910–922, Mar. 2017.
system for optimal microgrid economic operation,” IET Renew. Power [29] A. Hussain, V.-H. Bui, and H.-M. Kim, “Robust optimal operation
Gener., vol. 5, no. 3, pp. 258–267, May 2011. of AC/DC hybrid microgrids under market price uncertainties,” IEEE
[2] Q. Jiang, M. Xue, and G. Geng, “Energy management of microgrid Access, vol. 6, pp. 2654–2667, 2018.
in grid-connected and stand-alone modes,” IEEE Trans. Power Syst., [30] T. Dragičević, J. M. Guerrero, J. C. Vasquez, and D. Škrlec, “Supervisory
vol. 28, no. 3, pp. 3380–3389, Aug. 2013. control of an adaptive-droop regulated DC microgrid with battery man-
[3] F. Katiraei, R. Iravani, N. Hatziargyriou, and A. Dimeas, “Microgrid agement capability,” IEEE Trans. Power Electron., vol. 29, no. 2,
management,” IEEE Power Energy Mag., vol. 6, no. 3, pp. 54–65, pp. 695–706, Feb. 2014.
May/Jun. 2008. [31] S. Parhizi, A. Khodaei, and M. Shahidehpour, “Market-based versus
[4] W. Su and J. Wang, “Energy management systems in microgrid opera- price-based microgrid optimal scheduling,” IEEE Trans. Smart Grid,
tions,” Elect. J., vol. 25, no. 8, pp. 45–60, Oct. 2012. vol. 9, no. 2, pp. 615–623, Mar. 2018.
[5] D. E. Olivares, C. A. Cañizares, and M. Kazerani, “A centralized energy [32] A. Khodaei, “Microgrid optimal scheduling with multi-period islanding
management system for isolated microgrids,” IEEE Trans. Smart Grid, constraints,” IEEE Trans. Power Syst., vol. 29, no. 3, pp. 1383–1392,
vol. 5, no. 4, pp. 1864–1875, Jul. 2014. May 2014.
[6] L. Meng, E. R. Sanseverino, A. Luna, T. Dragicevic, J. C. Vasquez, and [33] R. Bayindir, E. Hossain, E. Kabalci, and K. M. M. Billah, “Investigation
J. M. Guerrero, “Microgrid supervisory controllers and energy manage- on North American microgrid facility,” Int. J. Renew. Energy Res., vol. 5,
ment systems: A literature review,” Renew. Sustain. Energy Rev., vol. 60, no. 2, pp. 558–574, Jun. 2015.
pp. 1263–1273, Jul. 2016. [34] X. Li, D. Hui, and X. Lai, “Battery energy storage station (BESS)-based
[7] V.-H. Bui, A. Hussain, and H.-M. Kim, “A multiagent-based hierarchical smoothing control of photovoltaic (PV) and wind power generation
energy management strategy for multi-microgrids considering adjustable fluctuations,” IEEE Trans. Sustain. Energy, vol. 4, no. 2, pp. 464–473,
power and demand response,” IEEE Trans. Smart Grid, vol. 9, no. 2, Apr. 2013.
pp. 1323–1333, Mar. 2018. [35] M. Abadi et al., “TensorFlow: A system for large-scale machine
[8] V.-H. Bui, A. Hussain, and H.-M. Kim, “Diffusion strategy-based dis- learning,” in Proc. OSDI, vol. 16, Nov. 2016, pp. 265–283.
tributed operation of microgrids using multiagent system,” Energies, [36] IBM ILOG CPLEX V12.6 User’s Manual for CPLEX 2015, CPLEX
vol. 10, no. 7, p. 903, Jul. 2017. Division, ILOG, Incline Village, NV, USA, 2015.
[9] Y. Xu and Z. Li, “Distributed optimal resource management based on [37] V.-H. Bui, A. Hussain, and H.-M. Kim, “Optimal operation of
the consensus algorithm in a microgrid,” IEEE Trans. Ind. Electron., microgrids considering auto-configuration function using multiagent
vol. 62, no. 4, pp. 2584–2592, Apr. 2015. system,” Energies, vol. 10, no. 10, pp. 1484–1500, Sep. 2017.
[10] E. C. Kara, M. Berges, B. Krogh, and S. Kar, “Using smart devices [38] A. Hussain, V.-H. Bui, and H.-M. Kim, “A resilient and privacy-
for system-level management and control in the smart grid: A rein- preserving energy management strategy for networked microgrids,”
forcement learning framework,” in Proc. IEEE Smart Grid Commun. IEEE Trans. Smart Grid, vol. 9, no. 3, pp. 2127–2139, May 2018.
(SmartGridComm), Nov. 2012, pp. 85–90.
[11] A. L. Dimeas and N. D. Hatziargyriou, “Multi-agent reinforcement Van-Hai Bui (S’16) received the B.S. degree in
learning for microgrids,” in Proc. IEEE Power Energy Soc. Gen. electrical engineering from the Hanoi University
Meeting, Jul. 2010, pp. 1–8. of Science and Technology, Vietnam, in 2013.
[12] G. K. Venayagamoorthy, R. K. Sharma, P. K. Gautam, and A. Ahmadi, He is currently pursuing the combined master’s
“Dynamic energy management system for a smart microgrid,” IEEE and Ph.D. degrees with the Department of
Trans. Neural Netw. Learn. Syst., vol. 27, no. 8, pp. 1643–1656, Electrical Engineering, Incheon National University,
Aug. 2016. South Korea. His research interests include
[13] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. microgrid operation and energy management
Cambridge, MA, USA: MIT Press, Oct. 2018. system.
[14] L. Panait and S. Luke, “Cooperative multi-agent learning: The state of
the art,” Auton. Agents Multi Agent Syst., vol. 11, no. 3, pp. 387–434,
Nov. 2005.
[15] E. Kuznetsova, Y.-F. Li, C. Ruiz, E. Zio, G. Ault, and K. Bell,
“Reinforcement learning for microgrid energy management,” Energy, Akhtar Hussain (S’14) received the B.E. degree
vol. 59, pp. 133–146, Sep. 2013. in telecommunications from the National University
[16] F.-D. Li, M. Wu, Y. He, and X. Chen, “Optimal control in microgrid of Sciences and Technology, Pakistan, in 2011,
using multi-agent reinforcement learning,” ISA Trans., vol. 51, no. 6, and the M.S. degree in electrical engineering from
pp. 743–751, Nov. 2012. Myongji University, Yongin, South Korea, in 2014.
[17] S. Dutta, Reinforcement Learning With TensorFlow: A Beginner’s Guide He is currently pursuing the Ph.D. degree with
to Designing Self-Learning Systems With TensorFlow and OpenAI Gym. Incheon National University, South Korea. He was
Birmingham, U.K.: Packt, Apr. 2018, pp. 1–336. an Associate Engineer with SANION; IEDs devel-
[18] V.-H. Bui, A. Hussain, and H.-M. Kim, “Q-learning-based operation opment company, South Korea, from 2014 to 2015.
strategy for community battery energy storage system (CBESS) in His research interests are distribution automation and
microgrid system,” Energies, vol. 12, no. 9, pp. 1789–1806, May 2019. protection, smart grids, and microgrid optimization.
[19] V. Mnih et al., “Playing Atari with deep reinforcement learn-
ing,” arXiv preprint arXiv:1312.5602, Dec. 2013. [Online]. Available:
https://arxiv.org/pdf/1312.5602.pdf
[20] E. Mocanu et al., “On-line building energy optimization using deep Hak-Man Kim (SM’15) received the first
reinforcement learning,” IEEE Trans. Smart Grid, vol. 10, no. 4, Ph.D. degree in electrical engineering from
pp. 3698–3708, Jul. 2019. Sungkyunkwan University, South Korea, in 1998,
[21] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” and the second Ph.D. degree in information
in Proc. Int. Conf. Mach. Learn., Jun. 2016, pp. 1928–1937. sciences from Tohoku University, Japan, in 2011.
[22] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning He was with Korea Electrotechnology Research
with double Q-learning,” in Proc. AAAI, vol. 2, Feb. 2016, p. 5. Institute, South Korea, from 1996 to 2008. He
[23] M. Wiering and M. O. Ma, Reinforcement Learning State-of-the-Art. is currently a Professor with the Department of
Berlin, Germany: Springer-Verlag, 2012. Electrical Engineering, Incheon National University,
[24] Y. Mohan, S. G. Ponnambalam, and J. I. Inayat-Hussain, “A comparative South Korea. His research interests include
study of policies in Q-learning for foraging tasks,” in Proc. IEEE Nat. microgrid operation and control and dc power
Biol. Inspired Comput., Dec. 2009, pp. 134–139. systems.

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on October 04,2020 at 07:42:10 UTC from IEEE Xplore. Restrictions apply.

You might also like