Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

The 4th International Conference on Smart Grid and Smart Cities

Reinforcement Learning Methods on Optimization Problems of Natural Gas


Pipeline Networks

Dong Yang, Siyun Yan, Dengji Zhou* Tiemin Shao, Lin Zhang, Tongsheng Xing
The key Lab. of power machinery and engineering of PetroChina Beijing Oil & Gas Pipeline Control Center
education ministry Beijing, China
Shanghai Jiao Tong University e-mail: stm@petrochina.com.cn
Shanghai, China linzhang@petrochina.com.cn
e-mail: ydqmkkx@gmail.com xingtongsheng@petrochina.com.cn
yansiyunay@sjtu.edu.cn
ZhouDJ@sjtu.edu.cn

Abstract—Traditional optimization methods of transport and In recent several years, with the continuous upgrading of
distribution of natural gas pipeline networks have been widely computer hardware, reinforcement learning (RL) has stepped
used up to now with some problems in efficiency, cost, and into the limelight with the prominent performance of
flexibility, which are hard to be solved in the framework of AlphaGo. In fact, as a powerful and effective algorithm in
traditional methods. In order to find the optimal solution in the solving complex optimization problems, RL algorithms have
constraints of each target of this optimization problem, this been widely applied in different fields. In view of the issue
paper establishes a simulation model based on a part of a of transport and distribution of natural gas pipeline networks,
natural gas pipeline networks, and utilizes the reinforcement it is theoretically feasible and potent to choose an appropriate
learning (RL) algorithm to analyze the model. The challenge of
method from RL algorithms to obtain an ideal optimizing
sparse rewards will also be dealt with. Then the optimal
strategy of transport and distribution of gas in this model is
strategy. Considering that the input parameters of the natural
obtained with different demands and initial conditions. The gas pipeline networks are high-dimensional and are all
parameters of the operation of the strategy can be displayed in continuous values, and due to the requirement of the
the simulation model and its advantages can also be fully robustness of the algorithm, the RL method based on Deep
reflected. Therefore, the scheme proposed in this paper can be Deterministic Policy Gradient (DDPG) is the optimal
directly or indirectly applied to the practical natural gas solution of method selection. DDPG is a model-free, off-
transport process. policy actor-critic algorithm based on deterministic policy
gradient which can learn policies in high-dimensional,
Keywords-natural gas pipeline networks, simulation model, continuous action spaces. Sparse rewards, as an intractable
optimization problem, reinforcement learning, DDPG, sparse and common challenge in RL, which we will also meet in
rewards, HER this paper. To solve this problem, we use some techniques,
the most powerful one of which is Hindsight Experience
I. INTRODUCTION Replay (HER). It allows RL algorithm to gain more feedback
from a sparse reward signal and helps the algorithm
As a kind of clean energy, the exploitation and utilization converge. Different kinds of rewards will also be discussed.
of natural gas has been developing rapidly in recent decades This paper will present a mathematical simulation model
and has the trend of supplanting other traditional fossil of a certain natural gas pipeline network based on Runge-
energy. In addition to conventional natural gas, some Kutta methods. The algorithm based on DDPG will be co-
developed countries such as America began to focus on shale simulated with the model to obtain the specific policy, and
gas and have made several breakthroughs. It can be predicted the effect of different methods will be reflected through the
that in the coming decades, various kinds of natural gas will parameters of the model.
be more widely applied in various production and daily life.
However, compared with the development of the natural gas
industry, the transport and distribution strategy of natural gas,
especially in the field of long-distance natural gas pipeline
networks, has not changed in essence. Although it has been
used for a long time, the traditional optimization strategy
does not perform well in efficiency and cost, since the
consumption in different regions is different at different
times of each day. In order to meet the demand of users, the
traditional optimization strategy can only fit the users’ habits
roughly in certain areas, in the process of which oversupply
and undersupply often occur. Figure 1. Gas flow in a pipeline

978-1-7281-9404-2/20/$31.00 ©2020 IEEE 29

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO MARANHAO. Downloaded on April 11,2023 at 21:05:26 UTC from IEEE Xplore. Restrictions apply.
TABLE I. NOMENCLATURE
P pressure Pa
M flow rate kg/s
c sound velocity m/s
D pipe diameter m
f friction coefficient
L pipe length m
s state
a action
μ actor network
Q critic network
θ weights of network Figure 2. Model of the natural gas pipeline network
T total time of each episode
N noise
Ё environment dPout c2 Min -Mout
r reward = (5)
dt A L
g goal dMin A fc2 M2in
k sampling frequency = (Pin -Pout )- (6)
dt L 2DAPout
W compressor power W
m polytropic coefficient
The model of the natural gas pipeline network [3] that we
R gas constant of natural gas J/(kg ·K)
T1 temperature of input gas K
will simulate is shown in Fig. 2. The natural gas network
η power generation efficiency consists of 15 nodes which are connected through 12
ϵ compression ratio pipelines. Com 1-6 are motor-driven compressors driven by
electricity. Natural gas is compressed by Com 1 and Com 2,
then injected into the network through node 1 and node 2,
II. SIMULATION MODEL and enter into another gas network through node 15. The gas
For the gas flow of a one-dimensional pipeline which is flowing from nodes 3-14 is supplied to users. We can see the
shown in Fig. 1, the dynamic parameters of which should arrows in red represent output gas, the pressure and flow rate
satisfy the law of mass conversation and Newton’s second demand of which should be met.
law as follows [1][2], To solve ordinary differential equations like (5) and (6),
Conservation of mass the Runge-Kutta method (RK4) is used, of which the
algorithm is shown as follows
A ∂P ∂M dx
=- (1) = f(t, x, y)
c2 ∂t ∂x
dt
Conservation of momentum dy
= g(t, x, y)
dt
∂P 1 ∂M fc2 M2
=- - (2)
∂x A ∂t 2A2 DP f1 =f(tk , xk , yk )
Almost all of the notations appear in this paper can be g1 =g(tk , xk , yk )
found in Table Ⅰ. For the state parameters of each pipe, we
use Pin and Mout to represent the input pressure and the output h h h
f2 =f(tk + , xk + f1 , yk + g1 )
flow rate, as does the same to Pout and Min. Based on the 2 2 2
finite element theory, in a short length of a section of the h h h
g2 =g(tk + , xk + f1 , yk + g1 )
pipeline, we can get the equations as follows 2 2 2

∂p Pout -Pin h h h
= (3) f3 =f(tk + , xk + f2 , yk + g2 )
∂x L 2 2 2
h h h
∂M Mout -Min g3 =g(tk + , xk + f2 , yk + g2 )
= (4) 2 2 2
∂x L

The following equations are derived from (1), (2), (3) and f4 =f(tk +h, xk +hf3 , yk +hg3 )
(4), then partial differential equations can be transformed g4 =g(tk +h, xk +hf3 , yk +hg3 )
into ordinary differential equations
h
xk+1 =xk + (f1 +2f2 +2f3 +f4 )
6
h
yk+1 =yk + (g1 +2g2 +2g3 +g4 )
6

30

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO MARANHAO. Downloaded on April 11,2023 at 21:05:26 UTC from IEEE Xplore. Restrictions apply.
The actor network is trained with the sampled gradient on
tk+1 =tk +h (7) the loss:
L(θμ ) = - Es [Q(s, μ'(st )|θQ )] (11)
TABLE II. INITIAL CONDITION OF PRESSURE AND GAS FLOW RATE
Pipeline P in P out M in M out
(Pa) (Pa) (kg/s) (kg/s)
1 8273880 7559378 272.1 272.1
2 8956846 8253444 282.1 282.1
3 7559378 7503000 240.8 240.8
4 8253444 8200000 245.1 245.1
5 9974835 9952000 176.5 176.5
6 9968193 9944000 181.6 181.6
7 10218930 10211786 100 100
8 10218930 10211786 100 100
9 10211786 10210000 50 50
10 10211786 10210000 50 50

Figure 3. Calculation process of simulation model TABLE III. DEMAND OF PRESSURE AND GAS FLOW RATE
Node P M Node P M
For each iteration of computation, among the pipelines in (Pa) (kg/s) (Pa) (kg/s)
the network, Pin and Pout of pipe11 and pipe 12 are fixed, 3 7559378 33.3 4 8253444 38.7
therefore Min and Mout can be calculated. In terms of other 5 7503000 36.1 6 8200000 32.2
pipes, Pin and Mout are fixed and Pout and Min can be 7 9974835 36.9 8 9968193 30.1
calculated. Then all of the dynamic parameters in the model 9 9952000 38.0 10 9944000 38.5
can be updated through parameter passing and the next 11 10218930 40.9 12 10211786 40.7
13 10211786 50 14 10210000 50
iteration begins. The entire calculation process of the
simulation model is shown in Fig.3. TABLE IV. PROPERTY PARAMETERS OF PIPELINES
III. ALGORITHM Pipeline c (m/s) f L (m) D (m)
1 300 0.005 200000 1
In this section, we will introduce some algorithms about 2 300 0.005 200000 1
RL used in this paper. Here is only an introduction, see 3 300 0.0005 200000 1
Lillicrap et al. (2015) [4] and Andrychowicz et al. (2017)[5] 4 300 0.0005 200000 1
for more details. 5 300 0.0005 200000 1
6 300 0.0005 200000 1
A. Deep Deterministic Policy Gradients (DDPG) 7 300 0.0005 200000 1
Deep Deterministic Policy Gradients (DDPG)[4] can be 8 300 0.0005 200000 1
seen as a combination of Actor-Critic[6] approach and Deep 9 300 0.0005 200000 1
10 300 0.0005 200000 1
Q Network (DQN)[7]. There are two neural networks in
11 300 0.0005 100000 1
DDPG, actor network μ(s|θμ ) and critic network Q(s, a|θQ ) 12 300 0.0005 100000 1
with weights θμ and θQ . To complete exploration, DDPG
adds noise sampled from a noise process N to the actor
policy B. Hindsight Experience Replay (HER)
Hindsight Experience Replay (HER) (Andrychowicz et
μ
μ' (st ) = μ(st |θt ) + N (8) al., 2017)[5] is created to settle the matter of sparse rewards
in RL. It is suitable for the cases where rewards are binary
After sampling a random mini-batch from the replay and are so sparse that the RL algorithm can hardly get
buffer, the critic network is trained the same way as Q- enough feedback. Suppose an action is obtained by the
network by minimizing the loss: policy π , at = π(st ) , and compute the reward value
rt = r(st , at ). Then store the transition (st, at, rt, st+1) in replay
L(θQ ) = Eμ [(Q(st , at |θQ ) - yt )2 ] (9) buffer.
HER introduces another variable, goal, into the transition,
where that is to say (st||g, at, rt, st+1||g) is stored in the replay buffer.
‘g’ is the notation of goal. Therefore, the action and reward
yt = r(st , at ) + γQ(st+1 , μ'(st+1 )|θQ ) (10) value are changed: at = π(st , gt ), rt = r(st , at , gt ).

31

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO MARANHAO. Downloaded on April 11,2023 at 21:05:26 UTC from IEEE Xplore. Restrictions apply.
Andrychowicz et al. (2017)[5] come up with some the reward function, can be designed as binary rewards or
strategies to set the value of goal, in which this paper will shaped rewards. Andrychowicz et al. (2017)[5] also
use is called ‘future’: in each episode, we replay with k discussed this question and said binary rewards perform
random states which come from the same episode as the better. We do not know whether this conclusion can be
transition being replayed and observed after it. k is a applied to our assignment, so we will compare the effects of
hyperparameter. these two rewards.
a) Binary rewards
IV. ASSIGNMENT
R1 = -1 if: SP < gP
A. Description of Assignment SM > gM
SP > 16 Mpa
The natural gas pipeline network in Fig.2 runs stably
without pipe 11 and pipe 12, which are shown in blue. To SM > 500 kg/s
ease the transporting pressure and improve stability, we need SP < 0
to add pipe 11 and pipe 12 in the pipeline network under the SM <0
condition that users’ demands are not affected. Namely, after R1 = 0 in other conditions
the redesign, the gas flow rate demand which is shown in red b) Shaped rewards
in Fig.2 should be the same as before, and the pressure of R1 = 0
pipe 3-14 should be greater than or equal to the previous If: SP < gP : R1 -= normalize((SP - gP)2)
situation. SM > gM: R1 -= normalize((SM – gM)2)
In order to ensure safety, for each pipeline, the pressure SP > 16 MPa: R1 -= normalize((SP –
should be less than 16 MPa and the gas flow rate should be 16000000)2)
no more than 500 kg/s. This assignment is a typical
SM > 500 kg/s: R1 -= normalize((SM – 500)2)
multifactor optimization decision question, the variables that
SP < 0: R1 -= normalize((SP)2)
we need to optimize and control are the gas flow rate of the
two inlets and the power of the six compressors. SM <0 R1 -= normalize((SM)2)
The initial condition of pressure and flow rate is shown
in TABLE Ⅱ. As we can see that once the flow of natural Through normalization, we can keep the absolute value
gas in a pipeline reaches steady-state, the input flow rate and of each term of the penalty between 0 and 1.
the output flow rate are the same. Besides, the demand for 2) Consumption reward (R2)
pressure and gas flow rate is shown in TABLE Ⅲ and the The formula of the power of the compressor is as follows:
property parameters of pipelines including pipe 11 and pipe
m-1
12 are shown in TABLE Ⅳ. W=
m RT1
[ϵ m -1] M (12)
m-1 η
B. Environment
States: s = {sP, sM}. The pressure of each node and the Where m is polytropic coefficient, R is gas constant of
gas flow rate of each pipeline constitute the elements of state. natural gas, T1 is the temperature of input gas, η is the
The process of normalization is necessary. power generation efficiency of the compressor, ϵ is
Actions: a = {aCom, aM}. Actions are made up of the compression ratio, M is gas flow rate. Those parameters of
compression ratio of compressors and the flow rate of inlets compressors that we will use are shown in TABLE Ⅴ. After
(Pipe 1 and Pipe 2). normalization, the total power W = ∑Wi, i in domain {1, 2,
Observations: s_= Ё(a). After the environment Ё
3, 4, 5, 6}.
receives an action, the current state is the observation.
Goals: g = {gP, gM}, gt ← sk. As we have discussed in
section Ⅲ, we take the ‘future’ method. In HER process, for R2 = 0 if R1 < 0
each episode, the total time is T, choose a time k in range (t, R2 = sigmoid (-W) if R1 = 0
T), then Sk is the gt. Of course, in normal loops, g is the
optimization target of pressure and gas flow rate. Therefore, with the fundamental demands fulfilled, the
higher the total power of these compressors, the smaller the
C. Reward Function (R) value of R2.
Reward function of our algorithm is determined by the Finally, R =R1+R2
target of the optimization problem. Our first goal is to meet
TABLE V. PARAMETERS OF COMPRESSORS
the demand for the pressure and flow rate of the natural gas
from the users under the condition of not causing damage to Com m R T η
the pipelines. And the second goal is to reduce the cost as (J/kg·K) (K)
much as possible, which is to reduce the total power of these 1 1.3 518.75 289 0.84
2 1.3 518.75 289 0.82
compressors.
3 1.3 518.75 289 0.78
1) Fundamental Reward (R1) 4 1.3 518.75 289 0.8
The fundamental reward that determines whether or not 5 1.3 518.75 289 0.83
the target state has been reached, which is the penalty term of 6 1.3 518.75 289 0.8

32

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO MARANHAO. Downloaded on April 11,2023 at 21:05:26 UTC from IEEE Xplore. Restrictions apply.
TABLE VIII. INITIAL CONDITION OF PRESSURE AND GAS FLOW RATE
Pipeline P in P out M in M out
(Pa) (Pa) (kg/s) (kg/s)
1 9405972 8483059 327.6 327.6
2 8981856 8482939 240.9 240.9
3 8483059 8416280 277.6 277.6
4 8482939 8441536 218.9 218.9
5 10260098 10230253 204.6 204.6
6 9994554 9976635 156.5 156.5
7 10280876 10269663 125.6 125.6
Figure 4. Success rate I 8 10273725 10269477 77.3 77.3
9 10269663 10267686 52.8 52.8
10 10269477 10267686 50.2 50.2
11 8482939 8483059 16.7 16.7
12 10269663 10269477 22.9 22.9

V. RESULTS AND ANALYSIS


In order to ensure the stability, convergence, and speed of
the RL algorithm, the hyperparameter T in each episode of
DDPG is set to 5 considering the range of compression ratio
of the compressors. And to prevent overfitting, the
hyperparameter k in HER is set to 1 and only take the
process of sampling of HER when T is in domain {0, 1, 2}.
Figure 5. Success rate II
States and actions are all normalized before input to the
neural network. Because of the high dimension of states,
convolutional neural networks are used to accelerate.
Due to the influence of random processes (including
initialization), we use the same random seeds to make sure
that the comparison of the results from different methods is
not disturbed.
Because we use the data from episode 1 to episode
300000, it looks bad to show the reward value of each
episode directly. In Fig. 4, we counted the number of
episodes which reach the target in every 1000 episodes and
calculate the success rate. To make it look more specific, we
also offer the success rate in every 10000 episodes in Fig. 5.
Figure 6. Positive value of rewards We can find that DDPG with shaped rewards starts faster
than the one with binary rewards, which means that in the
TABLE VI. STATISTICAL CHARACTER
early stage of learning shaped rewards can give DDPG more
explicit feedback. That’s why DDPG with only binary
Method Max Min Mean Std rewards could not converge and find the solution within
Binary + HER 0.4618 0.4345 0.4521 0.0025 300000 episodes.
Shaped + HER 0.4628 0.4331 0.4509 0.0030
After 150000 episodes, the curve of Binary + HER runs
Shaped 0.4652 0.4267 0.4531 0.0045
more stable than others, which shows that this method has
TABLE VII. POWER OF COMPRESSORS IN THE CASE OF MAX REWARD higher learning efficiency and makes the agent more stable.
(UNITS: MW) To further reflect the character of the results of these
Com Binary + HER Shaped + HER Shaped
methods, we show the positive value of rewards by a scatter
1 39.98 47.27 39.78 diagram as Fig. 6 and show the statistical character of these
2 30.12 20.71 27.62 values in Table VI.
3 8.14 11.75 9.41 It can be observed that HER can improve the robustness
4 11.02 9.08 6.02 of DDPG, however, high stability means less exploration and
5 1.16 0.17 0.15 low stability always brings more possibility.
6 1.41 2.25 0.65 For this assignment, we choose the best result from the
Total 91.84 91.23 83.63
method of shaped rewards. The condition of pressure and gas
flow rate of the pipeline network after using the policy and
redesign is shown in Table VIII.

33

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO MARANHAO. Downloaded on April 11,2023 at 21:05:26 UTC from IEEE Xplore. Restrictions apply.
VI. CONCLUSIONS REFERENCES
In this paper, we built a simulation model for a natural [1] Y. Zhang, H. Dai, N. Jenkins, “Studies on the analytical method of
gas pipeline network and then used the RL algorithm DDPG combined gas and electricity networks,” CEPRI, 2005.
to solve the design and optimization problem. The influence [2] X. Shen, Y. Li, “Studies on Transient Simulation and Optimization
Technology for Trunk Gas Network,” China University of Petroleum,
of binary rewards and shaped rewards and the effect of HER 2010.
were also discussed. The following conclusions can be [3] D. Zhou, S. Ma, D. Huang, H. Zhang, S. Weng, “An Operating State
drawn: Estimation Model for Integrated Energy Systems Based on
1) RL methods, like DDPG algorithm, can be used in the Distributed Solution,” unpublished.
design and optimization problems of natural gas pipeline [4] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D.
networks and have a pretty good performance. Silver, D. Wierstra, “Continuous control with deep reinforcement
2) In the condition that only binary rewards can be set, learning,” arXiv preprint arXiv: 1509.02971v6, 2015.
HER can help to converge and improve the robustness. [5] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P.
Sometimes it may have a powerful effect. Besides, if the Welinder, B. McGrew, J. Tobin, P.Abbeel, W. Zaremba, “Hindsight
Experience Replay,” arXiv preprint arXiv:1707.01495v1, 2017.
agent obtained after training is needed to be reused, such as
[6] R. S. Sutton, D. McAllester, S. Singh, Y. Mansour, “Policy gradient
transfer learning, the binary + HER method will bring the methods for reinforcement learning with function approximation,”
agent with better robustness. NIPS, vol. 99, pp. 1057–1063, November 1999.
3) For the optimization problems that the optimal policy [7] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antinoglou, D.
is needed, the shaped rewards method may work surprisingly. Wierstra, M. Riedmiller, “Playing Atari with Deep Reinforcement
Learning,” arXiv preprint arXiv: 1312.5602v1, 2013.
ACKNOWLEDGMENT [8] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller,
“Deterministic Policy Gradient Algorithms,” London, 2014.
This work is supported by the National Natural Science
Foundation of China (Grant No. 51706132). [9] S. Gu, T. Lillicrap, I. Sutskever, S. Levine, “Continuous deep q-
learning with model-based acceleration,” arXiv preprint arXiv:
1603.00748, 2016.
[10] V. Mnih, K. Kavukcuoglu, D. Silver, “Human-level control through
deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.

34

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO MARANHAO. Downloaded on April 11,2023 at 21:05:26 UTC from IEEE Xplore. Restrictions apply.

You might also like