Wang 2020

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
Transactions on Smart Grid
1
Deep Reinforcement Learning Method for

Demand Response Management of
Interruptible Load
Biao Wang, Yan Li, Weiyu Ming, and Shaorong Wang
based and incentive-based methods [6]. Price-based DR

Abstract—As an important part of incentive demand response represents the behavior of users in response to changes in
(DR), interruptible load (IL) can achieve a rapid response and electricity prices. As the most widely used price-based DR,
improve demand side resilience. Yet, model-based optimization time of use (TOU) tariff, where demand-dependent prices
algorithms concerning with IL require the explicit physical or
apply during different predeﬁned intervals of the day, has been
mathematical model of the system, which makes it difficult to
adapt to realistic operation conditions. In this paper, a model- applied as the default rate structure by many utility companies
free deep reinforcement learning (DRL) method with dueling [7]. Incentive DR refers to the behavior to obtain economic
deep Q network (DDQN) structure is designed to optimize the incentives. As an important part of incentive DR, interruptible
DR management of IL under the time of use (TOU) tariff and load (IL) is the interruptible part of user load during peak load
variable electricity consumption patterns. The DDQN-based periods or emergency, which can achieve a rapid response and
automatic demand response (ADR) architecture is firstly
improve demand side resilience. Usually, IL is realized by
constructed, which provides a possibility for real-time application
of DR. To obtain the maximum long-term profit, the DR signing economic contracts between utility companies and
management problem of IL is formulated as a Markov decision users [8]. Users have the obligation to interrupt the use of
process (MDP), in which the state, action, and reward function electricity on time as the contract in case of system peak or
are defined, respectively. The DDQN-based DRL algorithm is emergency, meanwhile, utility companies need to pay users
applied to solve this MDP for the DR strategy with maximum certain economic compensation [9]. Appropriate economic
cumulative reward. The simulation results validate that the
compensation can stimulate various users to reduce power
proposed algorithm with DDQN overcomes the noise and
instability in traditional DQN, and realizes the goal of reducing consumption at the peak of the system, and the distribution
both the peak load demand and the operation costs on the system operators (DSO) can avoid market risks and reduce
premise of regulating voltage to the safe limit. operation costs at the same time. In this paper, the
compensation scheme for IL adopts the high compensation
Index Terms—Deep reinforcement learning, dueling deep Q after outage [10], which is incurred only after the interruption
network, demand response, interruptible load, Markov decision of IL in case of emergency and closely related to the
process.
probability of emergency.
For the DSO, on the one hand, they should guarantee the
I. INTRODUCTION
normal power supply of users in the system. On the other hand,
W ith the development of smart grid and power market, the

concept of demand side management (DSM) has
attracted widespread attention recently [1]-[3]. DSM is one of
they have to purchase power from utility companies because
of the lack of large capacity generators. Considering two types
of DR projects at the same time, this paper integrates the
the important functions in a smart grid that allows customers optimization effect of TOU on load curves and the
to make informed decisions regarding their energy improvement for system security and economy by IL into the
consumption and helps the energy providers reduce the peak same model, reflecting the complementarity of the two. Under
load demand and reshape the load profile [4]. As a typical way the current TOU tariff, if DR is reasonably implemented for IL
of DSM, demand response (DR) is considered as the most in the system, it will play a good role in reducing the peak
cost-effective and reliable solution for the smoothing of the load and achieving economic benefits. The participation of IL
demand curve, when the system is under stress [5]. in system operation has been studied in some previous papers.
Various DR programs have been developed and In [11], the effects of changing incentive and penalty factor in
implemented, which can be mainly categorized into price- Interruptible/ Curtailable (I/C) DR Program in microgrid
operation is analyzed, which is modeled as a mixed-integer
Manuscript received July 31, 2019. This work was supported by the
nonlinear programming (MINLP) and solved by CPLEX
National Key R&D Program of China under Grant 2017YFB0902800. solver. In [12], the minimum cost IL commitment to ensure
B. Wang, Y. Li, W. Ming, S. Wang are with the State Key Laboratory of primary frequency response (PFR) adequacy is investigated,
Advanced Electromagnetic Engineering and Technology, Huazhong
which is modeled as the binary integer programming problem
University of Science and Technology, Wuhan 430074, China (e-mail:
biaowang96@foxmail.com; liyanhust@hust.edu.cn; 743892015@qq.com; and solved by iterative solution algorithm. In [13], the joint
wsrwy96@vip.sina.com).
1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on May 04,2020 at 14:12:08 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.2967430, IEEE
2
optimization of economic dispatch (ED) and IL management 3. Compared with the traditional DQN, the DDQN structure
is cast as a stochastic optimization problem to compensate for overcomes the noise and instability of value function during
the forecast errors of wind farm generation. Literature [14] the iteration and enhances the convergence stability of the
sets out a framework for decision making in emergency model.
operations for power transformers, which relies on The remaining of the paper is organized as follows. In
interruption contracts as decision variables. In general, the section II, The DDQN-based automatic demand response
optimization problems with IL can be modeled as a architecture is constructed and analyzed. Section III introduces
multiperiod MINLP in which the state of IL is considered as the MDP framework for DR management of IL, in which the
an integer decision variable. However, traditional optimization state, action, and reward function is formulated respectively.
algorithms that employ dynamic programming techniques Section IV defines the state action estimation function of
only take into account a specific system operation pattern, DDQN and describes the DRL-based algorithm. Based on this,
which makes it difficult to adapt to realistic operation the optimal DR strategy of IL is obtained. Section V applies
conditions such as variable electricity consumption patterns. the algorithm to the IEEE33 node extension system and
Unlike traditional model-based methods that require an confirms its validity through simulation. At last, the
explicit physical or mathematical model of the system, deep conclusions are drawn in Section V.
reinforcement learning (DRL), a combination of reinforcement
learning (RL) and deep learning [15], is a model-free II. DDQN-BASED AUTOMATIC DEMAND RESPONSE
algorithm to solve complex control problems [16]. DRL has ARCHITECTURE
shown successful performance in playing Atari [17] and Go Traditional DR requires the staff to operate the equipment
games [18] in recent years and its success in AlphaGo or set the operating conditions manually, which cannot
demonstrates that DRL has become one of the most guarantee the response speed of users as well as the reliability
fascinating research areas in machine learning. So far, there of DR implementation. Automatic demand response (ADR),
have been some attempts to apply the DRL method to the DR as one of the key technologies of smart grid, refers to the
control problem [19]-[23]. Paper [19] comprehensively implementation of demand response program through
reviews the use of reinforcement learning for DR applications automatic systems without manual intervention, which can
in the smart grid. In detail, paper [20] concerns with the joint truly achieve load scheduling and thereby improve energy
bidding and pricing problem of load serving entity, which is efficiency. In this sense, it is necessary to introduce
tackled by the deep deterministic policy gradient (DDPG) automation technology into DR.
algorithm in [21]. Paper [22] proposes a dynamic pricing DR By establishing a bidirectional communication network on
method with Q-learning for energy management in a the grid side and the user side, the ADR system realizes the
hierarchical electricity market. Paper [23] designs an actor- automation of DR combined with advanced measurement
critic-based DRL algorithm to determine the optimal energy technology and control methodology. The system architecture
management policy for industrial facilities. All of these papers of DDQN-based ADR is shown in Fig. 1. The automatic
formulate the DR control as a Markov decision process (MDP) demand response terminal (ADRT) integrates the input
and use respective DRL algorithm to make complex DR information of both the grid side and the user side to obtain
decisions adapting to specific constraints. state action transitions (st, at, rt+1, st+1) and transmits them to
In general, the DRL algorithm can be divided into the the database located in the computing center of DSO for
value-based and the policy-based algorithm, wherein deep Q storage through the communication network, wherein the grid
network (DQN) is the most classical value-based algorithm. side information includes real-time operation information of
The dueling deep Q network (DDQN) structure [24] is an the grid (such as node voltage) collected by the measurement
improvement of DQN [25], which effectively solves the device and the TOU electricity price tariff, and the user side
problem of overestimation of the DQN value function and information contains the interruption information and
enhances generalization ability of the model. Hence, this paper compensation scheme of each IL from the controller. As there
aims to propose a value-based DRL algorithm with DDQN may be many ILs widely distributed in the system, it is hard
structure to map the real-time state of the grid to the DR for the DSO to communicate with each IL individually, which
strategy for IL with the aim of reducing the peak load demand will make the communication and control network more
of the system as well as the operation costs of DSO. complicated. A load aggregation controller is added to the user
Considering the presented literature review, the main side of the independent system, which aggregates all ILs
contributions of this paper are as follows: within the system and performs remote read and interruption
1. The DR management problem for IL is formulated as an control for each IL.
MDP, which allows the consideration of an accumulative Moreover, the action execution and network training are
proﬁt of the DSO in the long-term; separated in the system in order to meet the real-time
2. The proposed DRL algorithm realizes the direct mapping requirement of ADR. As shown in Fig. 1, the computing
of the real-time state of the grid to the DR management center and the ADRT each contains a DDQN, which plays a
strategy, thus achieving the goals of both regulating voltage different role in the DR management of IL as described in the
and reducing the total operation costs of DSO under variable following procedure. The computing center obtains state
electricity consumption patterns; action transitions from the ADRT and stores them in the
3
database, and then DDQN at the computing center is trained A. Formulation of Markov Decision Process
based on the DRL algorithm. Furthermore, the DDQN of 1) State Formulation
ADRT is periodically updated by copying parameters from The composite state of the environment (S) includes the
that of the computing center after the training. Finally, operation state of the system (Sop) and the state of IL (Sil) as
combined with the external input of the grid, the DDQN with (1). In our case, the state of IL is defined as the power
updated network parameters can obtain the optimal DR consumption of each IL to track its interruption status and
strategy of IL (that is, whether the IL load is cut off during the determine the compensation costs paid to IL in real time. The
peak load period) by performing a forward calculation, and the operation state is denoted as the voltage at the end nodes and
controller deployed on the user side executes the DR strategy the total power consumption at the root node of the system,
on each IL and monitors the response results, thereby which is necessary to regulate voltage and shave peak load of
achieving the automation of DR for IL. the system, respectively. Accordingly, the observable state (st)
of the system at time t is defined as (2), including the voltage
of node i ( U ti ), the total power ( Pt total ), and the power
Computing Center Automatic Demand
Response System
of DSO
consumption of node j ( Pt j ) at time t. Nend and NIL represent
Data
DDQN
Base a set of end nodes and a set of IL access nodes of the system,
respectively.
st,at,rt,st+1 DDQN S = S op  S il (1)
Parameters
st = [Ut , Pt , Pt ] i  Nend , j  NIL
i total j
(2)
However, there are significant differences in the range of
DDQN
TOU Electricity Price Operation Information

ADRT of Power Grid numerical variations between voltage and power. In order to
Interruption Information facilitate the learning process, we normalize them to the range
DR Strategy
and Compensation Scheme
[0,1] as (3) before feeding to the neural network: (with U ti as
Controller
an example)
U ti − min(U ti )
IL
U ti ' = (3)
IL max(U ti ) − min(U ti )
Where the minimum and maximum of U ti can be obtained
Measurement
Device
from historical observation.
Grid IL 2) Action Formulation
The decision-making on the interruption of IL is the core
IL
Independent Distribution System problem in the proposed DR strategy, for which we choose the
Energy Information Control action combination of IL as the control action in this study.
Fig. 1. DDQN-based automatic demand response architecture for IL. Due to the limitation of current communication technology
and automation level, it is difficult to achieve the continuous
In summary, if the TOU tariff of the grid and the and precise control of load curtailment for IL. Thus, the action
compensation scheme for the interruption of each IL are given, strategy of IL is considered as two states in terms of whether
the total operation costs of DSO in one day include two parts: interrupts and the corresponding action function (at) at time t
the electricity bill paid to the grid and the compensation costs is set as (4):
for IL. It is assumed that the daily load curve of each node in at = [utj+1 ] j  NIL (4)
the system is independent, and there may be a low voltage j
problem at the end nodes of the system under initial operation Where ut +1 denotes the state variable of IL at node j at time
conditions (e.g., heavy load or long distribution line). The t+1 and has only two values of 0 or 1, the value of 0 indicates
research focus of this paper is to design the DDQN to map the the interruption of power supply at that time whereas the value
real-time state of the grid to the DR strategy for IL with the of 1 indicates non-interruption oppositely.
aim of regulating the system voltage to the safe limit and 3) Reward Formulation
reducing the operation costs of DSO. Under the current TOU tariff and compensation scheme of
IL, the ADRT needs to adopt the DR strategy of IL according
III. MARKOV DECISION PROCESS FOR DEMAND RESPONSE to the current state to regulate the voltage at the end nodes to a
CONTROL OF INTERRUPTIBLE LOAD reasonable range and reduce the total operation costs of DSO.
In addition, IL cannot be interrupted frequently during the day,
According to the previous analysis, the training of the
which greatly increases the complexity of operations for DSO.
parameters in DDQN is based on state action transitions (st, at,
Considering the above operational objectives comprehensively,
rt+1, st+1), which are defined by the Markov decision process
the immediate reward (rt+1) of DR management at time t+1 is
(MDP). In the following part of this section, we formulate the
established as (5):
MDP framework for DR management of IL, which presents
the composition of the state, action, and reward function in rt +1 = wvol rtvol
+1
(i )
+ 
wil rtil+1( j ) + weoc rteoc
iN end
+1 (5) 
jN IL
detail.
4
where rtvol
+1
(i )
is the voltage reward at end node i, rtil+1( j ) is the
DDQN
interruption reward at IL access node j, and rteoc
+1 is the
economic reward. wvol , wil , and weoc are their corresponding ADRT
weights respectively and can be selected according to the

specific operational target of DSO. State Reward
st = [Uti , Pttotal , Pt j ]
Action
as (5) at = [utj+1]
For rtvol
+1
(i )
, this paper selects ±5% of the rated voltage as the
safe limit to retain sufficient voltage margin, and measures the
Independent
voltage reward with the reward coefficient Freward and penalty distribution
system
coefficient Fpenalty . Specifically, if the action of IL regulates
Fig. 2. MDP framework of DR for IL.
the voltage from the over-limit status to the safe status, then
rtvol
+1
(i )
is taken as Freward . On the contrary, if the action causes IV. DEEP REINFORCEMENT LEARNING-BASED ALGORITHM
the original safe voltage to over-limit, rtvol
+1
(i )
is taken as Fpenalty , FOR DR MANAGEMENT
otherwise, it is 0. A. Action Estimation Function
For rtil+1( j ) , frequent disconnections of IL may breach user Formally, the goal of the DR management for IL is to find
satisfaction and consequently discourage their participation in the optimal control strategy that maps the observation state st
ADR programs. In order to minimize the number of to the control action at. In this paper, Q * (st , at ) (i.e., the
interruptions of IL in a control cycle and improve user optimal action estimation function) is used to represent the
satisfaction, this paper uses the state change of IL between two maximum accumulative reward of action at in state st, which is
adjacent sampling times to define the interruption reward as calculated by the Bellman Equation as follows:
(6). If the state changes, rtil+1( j ) takes a negative value -1, Q * ( st , at ) = E[rt +1 +  max Q * ( st +1 , at +1 ) st , at ] (8)
at +1  At +1
otherwise the value is 0.
where E denotes the expectation,   [0,1] is a discount
rtil+1( j ) = − utj+1 − utj (6)
factor which indicates the importance of future reward relative
The operation costs of DSO consist of two parts: the
to immediate reward, At +1 represents the set of actions that
electricity bill paid to the grid and the compensation costs for
IL. When the operation costs with DR are less than that can be performed at time t+1. The Q-learning method is used
without DR, it indicates that the control strategy of IL can to update the value estimation based on the state action
obtain economic benefits for DSO at that moment. The transitions (st, at, rt+1, st+1), as shown in (9).
economic reward is defined as (7): Q * ( st , at )  (1 −  )  Q( st , at ) +
 [rt +1 +  max Q(st +1 , at +1 )] (9)
rteoc
+1 =Pt
total
* pttou −
at +1 At +1
(7)
[ Pt total _ DR tou
*p
t + P t
j
* c *(1 − ut )]
j j where Q(st , at ) is the current estimated value of the optimal
jN IL action estimation function,   [0,1] is the learning rate which
where Pt total _ DR is the total power at the root node of the represents how much the results of previous training are
system with DR, whereas Pt total is the one without DR. pttou is retained. Such a value iteration algorithm converges to the
the TOU electricity price at time t, c j is the unit optimal action value function with enough number of
iterations.
compensation cost paid to the IL at node j. What needs to be
emphasized is that the above-mentioned rewards also need to B. Value Function Approximation Based on DDQN
be normalized due to their different dimensions and orders of s t
magnitude. Therefore, immediate rewards defined in this a

Q(s ,a ) t t
DQN t
paper are all between [0, 1] after the same normalization as (3). Current Q value
Q ' ( st , at , w)
at
B. Markov Decision Process Framework st
MUX
V(st)
With the state, action, and reward function defined, the st Q(st,at)
Update parameter
MDP framework of DR management for IL is shown in Fig. 2.
Loss
Experience Gradient
st
descent
replay
During each step, depending on the current state of the memory st+1
 max Q( st +1 , at +1 )
independent distribution system, the ADRT chooses an action Database A(st,at)
at +1 At +1
DDQN
(i.e., the DR strategy of IL) based on its current structure. The rt+1 +
Q * ( st , at )
system feeds back a reward and then enters the next state due Target Q value
Mini-batch of transitions
to the variation of the system power flow. Within the
Fig. 3. Value function approximation process based on DDQN.
interaction between the ADRT and the system, the DDQN
gradually updates its parameter during each learning episode, The absolute values between the state and action
which finally obtains the DR strategy of IL with the maximum dimensions of IL control differ a lot, which will cause the
cumulative reward. noise and instability of the Q-learning algorithm [24]. In this
5
section, a dueling deep Q network (DDQN) is taken to accumulation from line 5 to 10. In detail, as the step counter t
approximate the optimal Q value in (8). Fig. 3 shows the value increases, action at is selected according to the ε-greedy policy,
function approximation process based on DDQN. and the state action transition tuple is stored in the experience
As depicted in Fig. 3, the significant difference between pool in succession. When the number of samples in the pool
DDQN and DQN is the structure of the estimation neural accumulates to exceed the replay-start size M, the learning
network. Different from traditional DQN which contains only process from experience is conducted from line 11 to 15. In
one state action estimation network, the dueling DQN detail, a batch of samples with number n are drawn randomly
architecture represents both the state value network V(st) and from the pool in line 12. Then, the target Q values and the
action advantage network A(st, at) with a single deep model predicted Q values of the samples are calculated respectively
whose output combines the two to produce a state-action value in line 13, based on these, the loss function is calculated as (12)
Q(st, at). The Q function based on DDQN structure is defined in line 14. Finally, the batch gradient descent (BGD) method
as (10): is used to update the weights in DDQN in line 15.
1
Q( st , at ) = V ( st ) + [ A( st , at ) −  A( st , at' )] (10) Algorithm 1: DRL Algorithm for DR management of IL
A at'  1. Set hyper parameters: , , , n, T, N, M
where A represents a set containing all executable actions, 2. Initialize DDQN with random parameters w
A represents the number of all executable actions. The 3. Initialize the experience pool as an empty set
4. For t=1, T, do:
action advantage function is set to the individual action 5. Reset state st randomly
function minus the average value of all action advantage 6. Select action at with ε-greedy policy
functions in a certain state to remove redundant degrees of 7. Receive new state st+1 and reward rt+1
8. If the number of samples > N:
freedom and improve the stability of the algorithm. 9. Remove the oldest observation sample
In the following, key concepts of the DRL-based 10. Store (st, at, rt+1, st+1) into the experience pool
algorithm are formulated. 11. If the number of samples > the replay-start size M:
(1) In order to balance the exploration and exploitation, 12. Sample random mini-batch of (st, at, rt+1, st+1) with number n from
experience pool
the ε-greedy policy is used for action selection as (11): 13. Obtain the target Q values and the predicted Q values respectively
random  (   (T - t ) *  / T ) 14. Calculate the loss function as (12)
at =  (11) 15. Update the parameters w of DDQN by performing BGD method
arg max Q ( st，at ) (   (T - t ) *  / T )
'
 at  16. t= t+1
17.End for
where  is a fixed constant, T and t are the total steps and
current iteration step of learning respectively,  (0< ≤ ) is a C. The DR Strategy of IL Based on DDQN
random number generated by the computer. On the one hand, the action execution and network training
(2) The Q value estimation for all control actions can be of DDQN should be separated in the ADR system due to the
calculated by performing a forward calculation in DDQN. The limited computing ability of the ADRT. On the other hand, the
mean squared error between the target Q value and the output observation state varies with the user load in the independent
of the neural network is defined as the loss function in (12): distribution system, which results in the learning process of
1 n the DDQN at computing center being carried out continually.
Loss( w) =
n i =1 
[Q * ( st , at ) − Q ' ( st , at , w)]2 (12) Therefore, in order to achieve real-time DR management, the
DDQN of ADRT should be periodically updated by copying
where Q * (st , at ) (i.e., the target Q value) is calculated as (8), parameters from that of the computing center over a constant
Q ' (st , at , w) is the output of DDQN with parameters w , n time frame such as the control cycle of DR.
denotes the mini-batch size. Start
(3) In order to eliminate the strong correlations between the
samples in a short period, experience replay is adopted to store Update parameters
periodically
the state action transitions (st, at, rt+1, st+1) at each time step in
an experience pool with capacity N. Set t=0
Based on the above introductions, the DRL-based DR

management algorithm is provided in Algorithm 1. Read measurement data at time
t to form the current state s
In general, the algorithm can be decomposed into three
t
stages: initialization (lines 1 to 3), accumulation of experience Calculate the Q values of

(lines 5 to 10), and learning from experience (lines 11 to 15). different actions using DDQN t=t+1
During initialization, the hyper parameters of the DRL
algorithm are set. Then, the DDQN is initialized using the Select action a with maximum t
Q value
random parameters w and the experience pool is initialized as
an empty set. Y
Commencing on line 4, the algorithm enters episodic t＜Tmax？
iteration. At the beginning of each episode, the initial state is N

reset randomly to eliminate the coupling between samples and End
time during learning. The algorithm engages in experience Fig. 4. DR control flow of IL based on DDQN.
6
The DR control flow of IL based on DDQN is shown in Fig. The current TOU tariff of the power grid and the
4: where Tmax denotes the maximum control cycle of DR and compensation scheme [27] for each IL are shown in Table I
is selected as 24 hours in this paper. As shown in Fig. 4, the and Table II, respectively.
ADRT periodically reads parameters of the DDQN at the B. Simulation Parameter Settings
computing center and updates its parameters through the
According to the above case scenario, the sample data of
bidirectional communication network. Therefore, the optimal
the observation state, control action and immediate reward
DR strategy of IL in each observation state is selected by just
function in the MDP framework of Section III are formulated
performing a forward calculation and comparing the Q values
in Table III: where the weights of immediate reward are
of different actions, which greatly reduces the calculation time
selected according to the specific operational target in practice.
of ADRT and provides a possibility for real-time application
In particular, the target of the DR management is to reduce the
of ADR.
total operation costs of DSO on the premise of regulating
voltage to the safe limit, so the weights of the voltage reward,
V. CASE STUDY AND RESULT ANALYSIS
interruption reward and economic reward in our case study are
A. Case Scenario taken as 0.5, 0.2 and 0.3, respectively. Meanwhile, the reward
In this paper, the IEEE 33-node extension system is coefficient and penalty coefficient are taken as 1 and -1
selected as a typical model of medium voltage distribution respectively for normalization.
system as shown in Fig. 5, where nodes #6, #13, #23, and #29
Table III
are connected to IL, whereas the remaining nodes are FORMULATION OF THE SAMPLE DATA
connected to the general load. To keep consistent with the Function Corresponding sample data
original load level, the realistic daily load curve at each node [(U t17 ,U t21 ,U t24 ,U t32 ), Pt total ,
in the extension system aggregates the low voltage load Observation state st
( Pt 6 , Pt13 , Pt 23 , Pt 29 )]
profiles selected from the IEEE European Low Voltage Test
Feeder [26] to fit in with the IEEE 33-node system. The Control action at [ut6+1 , ut13+1 , ut23+1 , ut29+1 ]
system parameter setting is consistent with the literature.
Obtained by (5): where wvol =0.5,
wil =0.2, weoc =0.3, Freward
18 19 20 21
19 20 21 35 Immediate reward rt+1 =+1
18 33
and Fpenalty =-1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
34
22
25
25 26 27 28 29 30 31 32
Then, the DDQN structure in the simulation is set as
22 23 24
26 27 28 29 30 31 32 36
follows: the input layer of both the state value network V(st)
and the action advantage network A(st, at) has 9 neurons (i.e.,
23 24 37
Interruptible load Measurement device

st). The hidden layer of V(st) and A(st, at) has 6 and 8 neurons
Fig. 5. The IEEE33 node extension system with IL.
respectively and both use the rectified linear unit (Relu) as the
activation function. The output layer of V(st) and A(st, at) has 1
It is assumed that there are measurement devices at the IL and 16 neurons (representing the number of action
access nodes (#6, #13, #23, #29), the root node (#0), and the combinations of IL) respectively. All these layers are fully
end nodes (#17, #21, #24, #32) in the system, making the connected. The hyper parameters of Algorithm 1 used in the
power of the root node and IL access nodes, and the voltage of simulation are summarized in Table IV. Specifically, the value
the end nodes observable, respectively. The measurement data of the discount factor in the case is 0.7, and the other hyper
is sampled every 15 minutes (forming a 33-node distribution parameters related to the neural network in Table IV are
system with a total of 96 kinds of consumption patterns a day), determined based on common practice recommended by the
which makes the system operation flexible and changeable. deep learning community [28].
Table I Table IV
TOU TARIFF OF GRID HYPER PARAMETERS OF DRL ALGORITHM
Periods Time (h) Price (yuan/kWh) Hyper Parameters Value
08:00-12:00 discount factor  0.7
Peak 1.25
16:00-23:00 learning rate  0.01
06:00-08:00 fixed constant  1
Flat 12:00-16:00 0.8 mini-batch size n 30
23:00-24:00 total learning steps T 3000
Valley 00:00-06:00 0.4 replay memory capacity N 1000
replay-start size M 100
Table II
COMPENSATION SCHEME FOR IL
IL Compensation cost (yuan/kWh) C. Results and Analysis
6 1.2 Applying the algorithm of Section IV to the case study, the
13 0.84
23 0.96 learning process of the DRL algorithm is programmed with
29 1.4 the aid of TensorFlow, an open-source deep learning
7
framework developed by Google. Firstly, the learning without DR, the voltage of nodes #17 and #32 will be low or
performance of the neural network is evaluated. The even over-limit, indicating that the existing load level will
comparison of the loss function between DDQN and cause the low voltage problem at the end nodes of the system.
traditional DQN structure during iterative training is shown in Fig. 8 (a) shows that the regulation effect with DQN is limited
Fig. 6. Compared with DQN, the loss function using the as the voltage of node #17 is still over-limit after DR. In
DDQN structure has a faster rate of decline, and reaches a contrast to DQN, the voltage of node #17 and #32 with the DR
smaller value at the end. Meanwhile, the fluctuation of the loss management of DDQN rises significantly from the over-limit
function is smaller, indicating that the algorithm with DDQN status to the safe status (i.e., ±5% of the rated voltage), which
is more stable. verifies the effectiveness of the voltage reward. Therefore, the
In order to demonstrate the convergence of the proposed voltage regulation effect with DDQN is obvious, ensuring the
algorithm, Fig. 7 shows the comparison of cumulative reward voltage of all the nodes in the system within the safe limit and
at time 19:00 (when the voltage is the lowest of the day) thus effectively solving the low voltage problem of the
between DDQN and DQN during iterative training. It is distribution network.
obvious from Fig. 7 that as experience is gained via episodic 1.02
iteration, the cumulative reward gradually increases and 17 without DR

17 with DR
21 without DR
21 with DR
24 without DR
24 with DR
32 without DR
32 with DR
finally converges to the maximum. Compared with DQN, the 1.00
maximum cumulative reward with DDQN at about episode
Node voltage (pu)

2,000 is much bigger, yielding the optimal policy, which 0.98
means the optimal action with maximum cumulative reward is
selected. Even though there are some fluctuations caused by 0.96
the random action with ε-greedy policy during the training 0.95
process, the overall trend of the curve proves the convergence 0.94
of the algorithm. 0.93
140
DQN 0.92
0 2 4 6 8 10 12 14 16 18 20 22 24
DDQN
120
Time (h)
100 (a) DQN structure
Loss function
1.02
80
17 without DR 21 without DR 24 without DR 32 without DR
17 with DR 21 with DR 24 with DR 32 with DR
60
1.00
40
Node voltage (pu)
20 0.98
0
0 500 1000 1500 2000 2500 3000 0.96
Learning episode
0.95
Fig. 6. Comparison of loss function between DDQN and DQN during the 0.94
learning process.
0.93
3 0.92
DQN 0 2 4 6 8 10 12 14 16 18 20 22 24
DDQN
Time (h)
(b) DDQN structure
Cumulative reward
2
Fig. 8. Comparison of voltage regulation of all end nodes (#17, #21, #24, #32)
between DDQN and DQN.
1 For the role of DR management in reducing the peak load

demand and operation costs, the comparison of the total load
curve of the system between DDQN and DQN is shown in Fig.
0 9. Compared with the initial daily load without DR, the daily
0 500 1000 1500 2000 2500 3000
peak load with DQN which appears at 18:00 is shaving by
Learning episode
13.6%, and the one with DDQN which appears at 19:00 is
Fig. 7. Comparison of cumulative reward at time 19:00 between DDQN and shaving by 20.5% further, which verifies the effectiveness of
DQN during the learning process.
proposed algorithm with DDQN in shaving the peak load of
After the learning process, the DR management of IL based the distribution network. Fig. 10 compares the total operation
on DDQN is performed according to the flow in Fig. 4. For costs of DSO in one day between DDQN and DQN. The
the purpose of voltage regulation, Fig. 8 shows the comparison original cost without DR is 36,181.1 yuan, all of which is the
electricity bill paid to the grid. The cost with DR incorporates
of voltage regulation of all end nodes in the system between
the compensation costs as well as the electricity bill.
DDQN and DQN, where the dotted line represents the initial
Specifically, the operation cost with DQN is 30,921.24 yuan
nodal voltage without DR, and the solid line denotes the in all, whereas the one with DDQN is 30,062.00 yuan, which
optimized voltage with DR. Under the initial load level
8
verifies the effectiveness of the economic reward. Thus, the required to interrupt its power consumption. Therefore, the
DR strategy of IL obtained by the proposed DDQN method interruptible price in practice needs to be officially confirmed
reduces the total costs of DSO by about 16.9% in one day, by the electricity regulatory department, and there is a
greatly improving the economic benefits under the premise of minimum capacity requirement for participating users.
ensuring normal power supply to other general users.
400 IL6 before IL6 after 250 IL13 before IL13 after
3500
350
Without DR 200
Power (kW)
Power (kW)
300
3000 DQN 250 150
200
DDQN 150 100
2500 100
Total power (kW)
50
50
0 0
2000
0 4 8 12 16 20 24 0 4 8 12 16 20 24
1500 Time (h) Time (h)
600 IL23 before IL23 after 300 IL29 before IL29 after
1000
500 250
Power (kW)
Power (kW)
400 200
500
300 150
200 100
0
0 2 4 6 8 10 12 14 16 18 20 22 24 100 50
Time (h) 0 0
0 4 8 12 16 20 24 0 4 8 12 16 20 24
Fig. 9. Comparison of the total load curve between DDQN and DQN. Time (h) Time (h)
Fig. 11. The optimized load curve of each IL with DDQN.
45000
Electricity bill
40000 Compensation costs IL6
IL13
35000 IL23
30921.24 30062.00 IL29
30000
Total cost (yuan)
4783.38 5163.61
25000 41.08% 30.91%
20000
36181.10
15000
26137.86 24898.39
10000
8.57% 19.44%
5000
0
Without DR DQN DDQN
Fig. 12. The proportion of each IL in compensation costs with DDQN.
Fig. 10. Comparison of total operation costs in one day for DSO between
DDQN and DQN.
As discussed earlier, the discount factor is used for the
It can be concluded from Fig. 6 to Fig.10 that the DR consideration of the long-term behavior. To verify its
strategy with DDQN is apparently superior to that with DQN. adaptability to the DR management of IL, Fig. 13 shows the
With regard to the action combinations of IL under the boxplot of the total cumulative rewards, i.e., the sum of
proposed DR strategy with DDQN, the optimized load curve immediate rewards during a day, under different discount
of each IL in one day is shown in Fig. 11: where the blue line factor values. The DR management with =0.7 which has
represents the initial load curve without DR whereas the red bigger mean cumulative rewards outperforms those with other
line represents the one with DDQN-based DR. The power of  values. It indicates that too small  may make the decision
"0" indicates that the power supply to the IL is interrupted at “short-sighted”, but too large value may also cause inaccurate
that moment and the interruption compensation fee is paid to action selection at some moment. Thus, the value of discount
the IL by DSO at the same time. It can be seen from Fig. 11 factor should be selected via comparison for specific control
that all ILs in the case are interrupted at the time interval problem in practice.
mainly between 08:00-13:00 and 17:00-23:00, which belongs
to the peak periods with heavy load and expensive electricity 15
price. Specifically, IL #6 is interrupted at interval 7:45-13:15;

10
IL #13 is interrupted at interval 19:30-24:00; IL #23 is
interrupted at interval 17:00-22:30; and IL #29 is interrupted
Cumulative reward
5
at interval 16:45-23:30. Moreover, the interruption time of
each IL focuses on one certain interval, which improves user 0
satisfaction and verifies the effectiveness of the interruption
reward. -5
Taking the compensation costs for IL under the proposed

DR strategy with DDQN structure for further analysis, the -10
proportion of each IL in compensation costs is obtained as

-15
shown in Fig. 12, where the proportion of IL #23 is the largest 0.1 0.3 0.5 0.7 0.9
whereas the proportion of IL #13 is the smallest. This is Discount factor 
because the larger the IL capacity, the more compensation cost Fig. 13. Impacts of discount factor on total cumulative rewards.
9
VI. CONCLUSION AND FUTURE WORK Power Syst., vol. 31, no. 3, pp. 2055-2063, May 2016.
[13] L. Yang, M. He, V. Vittal and J. Zhang, “Stochastic optimization-based
This paper fully explored the optimized scheduling space of economic dispatch and interruptible load management with increased
IL at demand side and presented a value-based DRL method wind penetration,” IEEE Trans. Smart Grid, vol. 7, no. 2, pp. 730-739,
with DDQN to optimize the DR management of IL in the case Mar 2016.
[14] J. C. S. Sousa, O. R. Saavedra and S. L. Lima, “Decision making in
of realistic daily load of users. Through simulation analysis, emergency operation for power transformers with regard to risks and
the following conclusions can be drawn: interruptible load contracts,” IEEE Trans. Power Deliv., vol. 33, no. 4,
(1) The maximum long-term profit of the DR management pp. 1556-1564, Aug. 2018.
for IL is obtained by formulating this problem as an MDP [15] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D.
Wierstra, and M.A. Riedmiller. (2013). Playing Atari with deep
through the analysis of cumulative reward. reinforcement learning. [online]. Available: https://arxiv.org/abs/
(2) The proposed DRL-based algorithm realizes the direct 1312.5602.
mapping of the real-time state of the grid to the ADR [16] Y. Hu, W. Li, K. Xu, T. Zahid, F. Qin, and C. Li, “Energy management
management strategy, thus achieving the goals of both strategy for a hybrid electric vehicle based on deep reinforcement
learning,” Appl. Sci., vol. 8, no. 2, Jan. 2018.
regulating voltage and reducing the total operation costs of [17] V. Mnih et al., “Human-level control through deep reinforcement
DSO. learning,” Nature, vol. 518, pp. 529-533, Feb. 2015.
(3) Based on DDQN structure, this paper overcomes the [18] D. Silver et al., “Mastering the game of go with deep neural networks
noise and instability in the traditional DQN algorithm and and tree search,” Nature, vol. 529, no. 7587, pp. 484-489, 2016.
[19] J. R., Vázquez-Canteli, and Z. Nagy, “Reinforcement learning for
illustrates the convergence stability of DDQN by comparing demand response: A review of algorithms and modeling techniques,”
the decline of the loss function. Appl. Energy, vol. 235, pp. 1072-1089, Feb. 2019.
The further research work will focus on taking more [20] H. Xu, K. Zhang, and J. Zhang, “Optimal joint bidding and pricing of
realistic constraints of the interruption contracts such as the profit-seeking load serving entity,” IEEE Trans. Power Syst., vol. 33, no.
5, pp. 5427-5436, Sep. 2018.
load curtailment into consideration, which will better adapt to [21] H. Xu, H. Sun, D. Nikovski, S. Kitamura, K. Mori and H. Hashimoto,
the production demand of IL and further improve user “Deep reinforcement learning for joint bidding and pricing of load
satisfaction. serving entity,” IEEE Trans. Smart Grid, vol. 10, no. 6, pp. 6366-6375,
Nov. 2019.
[22] R. Lu, S. H. Hong, and X. Zhang, “A dynamic pricing demand response
REFERENCES algorithm for smart grid: Reinforcement learning approach,” Appl.
[1] P. Palensky and D. Dietrich, “Demand side management: demand Energy, vol. 220, pp. 220–230, Jun. 2018.
response, intelligent energy systems, and smart loads,” IEEE Trans. Ind. [23] X. Huang, S. H. Hong, M. Yu, Y. Ding and J. Jiang, “Demand response
Inf., vol. 7, no. 3, pp. 381-388, Aug. 2011. management for industrial facilities: a deep reinforcement learning
[2] A. Mohsenian-Rad, V. W. S. Wong, J. Jatskevich, R. Schober and A. approach,” IEEE Access, vol. 7, pp. 82194-82205, 2019.
Leon-Garcia, “Autonomous demand-side management based on game- [24] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de
theoretic energy consumption scheduling for the future smart grid,” Freitas. (2015). Dueling network architectures for deep reinforcement
IEEE Trans. Smart Grid, vol. 1, no. 3, pp. 320-331, Dec. 2010. learning. [online]. Available: https://arxiv.org/abs/1511.06581.
[3] C. Li, X. Yu, W. Yu, G. Chen and J. Wang, “Efficient computation for [25] H. van Hasselt, A. Guez, and D. Silver. (2015) Deep reinforcement
sparse load shifting in demand side management,” IEEE Trans. Smart learning with double q-learning. [online]. Available:
Grid, vol. 8, no. 1, pp. 250-261, Jan. 2017. https://arxiv.org/abs/1509.06461v1.
[4] T. Logenthiran, D. Srinivasan and T. Z. Shun, “Demand side [26] (2016). IEEE PES AMPS DSAS Test Feeder Working Group. [online].
management in smart grid using heuristic optimization,” IEEE Trans. Available: http://sites.ieee.org/pes-testfeeders/resources/.
Smart Grid, vol. 3, no. 3, pp. 1244-1252, Sept. 2012. [27] L. Zhu, X. Zhou, L. Tang, and C. Lao, “Multi-objective optimal
[5] J. S. Vardakas, N. Zorba and C. V. Verikoukis, “A survey on demand operation for microgrid considering interruptible loads,” Power System
response programs in smart grids: pricing methods and optimization Technology, vol. 41, no. 6, pp. 1847-1854, 2017.
algorithms,” IEEE Commun. Surv. Tutor., vol. 17, no. 1, pp. 152-178, [28] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning.
Firstquarter 2015. Cambridge, MA, USA: MIT Press, 2016.
[6] H. Karimi, S. Jadid and H. Saboori, “Multi-objective bi-level
optimisation to design real-time pricing for demand response programs
in retail markets,” IET Gener. Transm. Distrib., vol. 13, no. 8, pp. 1287-
BIOGRAPHIES
1296, May 2019.
[7] L. Zhao, Z. Yang and W. Lee, “The impact of time-of-use (TOU) rate Biao Wang received the B.S. degree in electrical engineering from Wuhan
structure on consumption patterns of the residential customers,” IEEE University of Technology (WHUT), Wuhan, China in 2018. He is currently
Trans. Ind. Appl., vol. 53, no. 6, pp. 5130-5138, Nov. 2017. pursuing the M.S. degree in electrical engineering at Huazhong University of
[8] T. W. Gedra and P. P. Varaiya, “Markets and pricing for interruptible Science and Technology (HUST). His current research interests include
demand response, electric vehicles, and artificial intelligence.
electric power,” IEEE Trans. Power Syst., vol. 8, no. 1, pp. 122-128, Feb.
1993.
[9] K. Bhattacharya, M. H. J. Bollen and J. E. Daalder, “Real time optimal
interruptible tariff mechanism incorporating utility-customer
interactions,” IEEE Trans. Power Syst., vol. 15, no. 2, pp. 700-706, May Yan Li received the M.S. and Ph.D. degrees in electrical engineering from
2000. Huazhong University of Science and Technology (HUST), Wuhan, China, in
[10] J. Wang, X. Wang and X. Ding, “The forward contract model of 1999 and 2005, respectively. Currently, she is an Associate Professor with the
interruptible load in power market,” in Proc. IEEE PES Transm. Distrib. School of Electrical and Electronics, HUST. Her current research interests
Conf. Exhib. Asia Pacific, Dalian, China, 2005, pp. 1-5. include power system operation and control, distribution network planning,
[11] M. H. Imani, K. Yousefpour, M. T. Andani and M. Jabbari Ghadi, energy internet, Smart/Micro grid, distributed generation and so on.
“Effect of changes in incentives and penalties on interruptible/curtailable
demand response program in microgrid operation,” in Proc IEEE Texas
Power Energy Conf. (TPEC), Texas, USA, 2019, pp. 1-6.
[12] R. Bhana and T. J. Overbye, “The commitment of interruptible load to Weiyu Ming received the B.S. degree from the Department of Electrical
ensure adequate system primary frequency response,” IEEE Trans. Engineering, Hefei University of Technology, Hefei, China, in 2019. He is
10
currently pursuing the M.S. degree in electrical engineering at Huazhong

University of Science and Technology. His research interest includes
distribution system operation optimization.
Shaorong Wang received the B.S. degree from Zhejiang University,

Hangzhou, China, in 1984, the M.S. degree from North China Electric Power
University, Baoding, China, in 1990, and the Ph.D. degree from the Huazhong
University of Science and Technology (HUST), Wuhan, China, in 2004, all in
electrical engineering. He is currently a Professor with the School of Electrical
and Electronics, HUST. His current research interests include power system
operation and control, smart grid, big data and power system planning.

Wang 2020

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wang 2020

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Deep Reinforcement Learning Method for

based and incentive-based methods [6]. Price-based DR

W ith the development of smart grid and power market, the

TOU Electricity Price Operation Information

weights respectively and can be selected according to the

magnitude. Therefore, immediate rewards defined in this a

Based on the above introductions, the DRL-based DR

stages: initialization (lines 1 to 3), accumulation of experience Calculate the Q values of

iteration. At the beginning of each episode, the initial state is N

Interruptible load Measurement device

iteration, the cumulative reward gradually increases and 17 without DR

Node voltage (pu)

1 For the role of DR management in reducing the peak load

1500 Time (h) Time (h)

price. Specifically, IL #6 is interrupted at interval 7:45-13:15;

Taking the compensation costs for IL under the proposed

proportion of each IL in compensation costs is obtained as

currently pursuing the M.S. degree in electrical engineering at Huazhong

Shaorong Wang received the B.S. degree from Zhejiang University,

You might also like