Paper Earth Learning 2

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/261380458
Reinforcement learning approach to dynamic activation of base station

resources in wireless networks
Conference Paper · September 2013

DOI: 10.1109/PIMRC.2013.6666710
CITATIONS READS
8 127
2 authors:
Peng-Yong Kong Dorin Panaitopol

Khalifa University Thales Communications & Security (TCS)
111 PUBLICATIONS 852 CITATIONS 28 PUBLICATIONS 99 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
cognitive radio View project
All content following this page was uploaded by Peng-Yong Kong on 29 October 2014.
The user has requested enhancement of the downloaded file.

Reinforcement Learning Approach to Dynamic
Activation of Base Station Resources in Wireless
Networks
Peng-Yong Konga and Dorin Panaitopolb
a KhalifaUniversity of Science, Technology & Research (KUSTAR), Abu Dhabi, United Arab Emirates.
b NEC Technologies (UK), FMDC Lab., ComTech Department, Nanterre, France.
E-mail: pengyong.kong@kustar.ac.ae, dorin.panaitopol@nectech.fr
Abstract—Recently, the issue of energy efficiency in wireless base station as it accounts for almost 40% of total energy
networks has attracted much research attention due to the consumption, and significant improvement in energy efficiency
growing concern on global warming and operator’s profitability. can be achieved by using advanced power amplifier. In the
We focus on energy efficiency of base stations because they
account for 80% of total energy consumed in a wireless network. second category, the efforts include minimizing the number
In this paper, we intend to reduce energy consumption of a base of base stations deployed by using a higher density of low
station by dynamically activating and deactivating the modular power micro and pico base stations [6]. These works in
resources at the base station depending on the instantaneous both the categories consider the telecommunication network
network traffic. We propose an online reinforcement learning to carry a specific fixed volume of traffic, which is generally
algorithm that will continuously adapt to the changing network
traffic in deciding which action to take to maximize energy saving. a representation of the peak hours scenario. In reality, network
As an online algorithm, the proposed scheme does not require traffic is dynamic as it changes temporally from time to time.
a separate training phase and can be deployed immediately. For instance, network traffic is usually high during office hours
Simulation results have confirmed that the proposed algorithm when many business activities are carried out, and the network
can achieve more than 50% energy saving without compromising traffic drops to a minimal level at night when most people are
network service quality which is measured in terms of user
blocking probability. asleep.
Index Terms—Green wireless networks, energy efficient base It is not energy efficient to maintain base stations to be fully
station, reinforcement learning, online Q-Learning. functional as at the peak hours while the actual network traffic
is at its minimal. In [7], dynamic network traffic characteristics
I. I NTRODUCTION are exploited in reducing energy consumption by shutting
The increasingly popular term “green wireless networks” down some base stations during a low traffic period, or
refers to technological advanced wireless networks with im- by controlling the cell size depending on the traffic load.
proved energy efficiency [1]. We are interested in improving However, these existing works assume each base station is
the energy efficiency to reduce carbon emissions and to lower a single entity which must be controlled as a whole unit.
energy cost of the network operator ([2], [3]). Carbon emission We envisage that the future green base stations will have
is an important environmental issue because increasing release their resources organized as a collection of modular units
of carbon directly into the atmosphere is perceived as the cause each with its own energy consumption profile. These modular
of global warming crisis. In addition to this environmental resource units can be radios, baseband processors, feeders,
aspect, escalating energy cost has etched into the profitability power amplifiers, air-conditioners, etc. This modular model
of wireless network operators. is similar to the system model that has been adopted by [8].
In wireless cellular network, there are various existing With the modular base station, this paper intends to exploit
efforts to improve energy efficiency at the base stations be- the dynamic nature of network traffic in reducing energy
cause they collectively account for about 80% of the total consumption by dynamically activating and deactivating the
energy consumption. These efforts can be broadly classified resource units. We propose a reinforcement learning algorithm
into two categories, namely (a) Improve energy efficiency of for the base station such that it can continuously adapt to
the base station itself, and (b) Reduce the required number the ever-changing network traffic in deciding either to turn
of base stations for each telecommunication network. In the on an additional module, to turn off an already activated
first category, the efforts involve controlling the transmission module or to maintain status quo. Reinforcement learning is a
power more optimally through parameter optimization after machine learning technique that helps an agent to decide which
taking into account coverage and capacity requirements [4], action to take in a given environment so as to maximize some
or re-designing the base stations by using equipments and notions of cumulative reward [9]. In our context, the actions
components that are more energy efficient [5]. According are what to do with the activation of modular resource units,
to [5], power amplifier is the most critical component in a the environment is about the time-varying network traffic, and
the reward is related to the amount of energy saved. To realize making an outgoing call or receiving an incoming call. When
reinforcement learning, we we have designed an online Q- the call ends, the user becomes inactive and is considered
Learning algorithm, which does not require a separate training departed from the system. Let u[n] be the number of active
phase before deployment. The rest of this paper is organized as users in a give time slot n. The value of u[n] depends on
follows. We describe the system model in Section II. In Section u[n − 1], λ[n], which is the number of newly arrived users in
III, we propose the online Q-Learning algorithm. Section IV time slot n and µ[n], which is the number of newly departed
presents and discusses the evaluation results. This paper ends users in time slot n. As such,
with concluding remarks in Section V.
u[n] = u[n − 1] + λ[n] − µ[n]. (2)
II. S YSTEM M ODEL
In our system model, λ[n] and µ[n] are random variables with
We assume a discrete time model for the wireless network time-varying average values. For example, as illustrated in
where the time domain is divided into repetitive time slots of a Fig. 2, the average value for λ[n] is much higher at 2 PM
fixed duration T as illustrated in Fig. 1. In the model, system compared to that at 2 AM. This is a clear representation of
parameters change their values only at the beginning of a time different network traffic conditions at different times of a day.
slot.
Fig. 2. Average number of new user arrivals in a time slot as a function

of time in a day. The horizontal axis indicates the hour with a 24-hour time
notation.
Service quality of the wireless cellular network is measured

in terms of user blocking probability. Let U be the maximum
number of users that can be supported by each resource
module. Then, a blocking occurs when a new user becomes
active, but there are already U × m[n] active users at the
Fig. 1. System model for dynamic activation of modular resource units at
the base station. base station. It is desirable to have a very low user blocking
probability, says 0.01 or lower at all times for a good network
service quality.
We consider a base station with a pool of modular resources,
III. R EINFORCEMENT L EARNING A LGORITHM FOR
such as transmitters for GSM, carriers for 3G, HSDPA and
DYNAMIC R ESOURCE ACTIVATION
LTE, relays for IEEE 802.16j, or pico-cells for a very dense
network. These modular resources can be activated and deac- In this section, we propose a reinforcement learning algo-
tivated separately from time to time similar to the assumption rithm to dynamically control the number of activated resource
adopted in [8]. Let m[n] be the number of activated resource modules to reduce energy consumption without compromising
modules at time slot n. These activated resource modules service quality. We first define a baseline scheme for perfor-
contributes to the total energy consumption of the base station. mance benchmark before describing the proposed algorithm.
The base station energy consumption Pbs [n] at time slot n is A. Baseline scheme
determined as follows:
The baseline scheme does not dynamically control the
Pbs [n] = Pcnt + m[n] × Pr , (1) number of activated resource modules regardless of the actual
number of active users in the system. As such, m[n] = M ∀ n,
where Pcnt is the base station’s fundamental energy consump-
where M is the number of all available resource modules.
tion within a time slot, and it is determined by the load
While the number of activated resource modules is fixed,
independent factors of all equipments. In the equation, Pr is
following (1) the energy consumption at the base station is
the energy consumption of each module.
given below:
Traffic load at a base station is measured in terms of the
number of active users. A user becomes active when it is Pbs [n] = Pcnt + M × Pr . (3)
Notice that energy consumption for the baseline scheme is a As a result of taking the action, the system transitions to
constant that does not change from time to time even in the state S[n + 1] and the number of activated resource modules
presence of dynamic load. becomes m[n + 1] = m[n] + a[n] supporting not more than
m[n + 1] × U active users. In this new state, the number of
B. Reinforcement Learning Algorithm active users is u[n+1], which is a random variable. We do not
Recall that a reinforcement learning algorithm helps the assume or require u[n + 1] to follow any distribution function
base station in controlling the number of activated resource that is known a priori. The base station discovers the value
modules given the current system state. Here, the system of u[n] by simply looking into system at time slot n. If the
state is defined by two parameters, namely the number of number of active users does not exceed the supported limit,
active users and the number of activated resource modules. the system is rewarded with the amount of energy saving as
Therefore, at time slot n, the system state is given as S[n] = much as M − m[n + 1]. If the number of active users is
(u[n], m[n]). more than what the activated resource modules can support,
At a given system state, there is a set of actions from user blocking occurs and the system is penalized with a large
which the base station may choose one to take. These actions negative reward which is P in (6). The penalty is large so that
include activating some additional resource modules, turning there is no ambiguity in avoiding user blocking. In defining
off some currently activated modules, or simply do nothing. the reward function, the assumption is that by always looking
Let an action a[n] taken in a time slot n be an integer for the highest reward, the base station can achieve the goal of
number such that m[n + 1] = m[n] + a[n]. Notice that maximizing average energy saving per time slot while keeping
the action can be either a positive number or a negative user blocking at a low level.
number. A negative a[n] gives the number of resource modules The reinforcement learning algorithm must discover which
to be deactivated. A positive a[n] indicates the number of action will yield the most reward by trying them and observing
additional resource modules to be activated. When a[n] = 0, their outcomes. In practice, the action taken in a time slot may
there will be no change in the number of activated resource affect not only the immediate reward but also the next state
modules. Consider a state S[n] = (u[n], m[n]), the set and through that, all the subsequent rewards. To implement
of feasible actions is given as follows: A[n] = {−m[n], these processes of trying, observing and future looking, there
−(m[n] + 1), · · · , −1, 0, +1, +2, · · · , +(M − m[n])}, where are various existing reinforcement learning techniques, such as
a[n] ∈ A[n]. This set of feasible actions is formed after SMART, R-SMART and Q-Learning. We adopt Q-Learning
observing that the base station cannot turn off more resource for its simplicity [10]. Compared to existing Q-Learning
modules than what are already activated, and it cannot also algorithms that require a separate training phase before de-
turn on more resource modules than what are currently deac- ployment, we propose an online Q-Learning algorithm where
tivated. There are always M +1 feasible actions in A[n]. After the base station does not have the benefit of a separate training
taking an action a[n] in state S[n] = (u[n], m[n]), the system phase, and every action that is taken will incur a real reward.
transits immediately to state S[n + 1] = (u[n + 1], m[n + 1] = Online Q-Learning is closer to a practical scenario where the
m[n] + a[n]). user traffic pattern is ever-changing, and there is no prior
The reinforcement learning algorithm aims to map for each training data.
state, the best action such that the energy saving is maximized. In online Q-Learning, a quantity called Q-factor is used to
Here, the energy saving is calculated in relation to the baseline track the implied value of each state-action pair such that, for
scheme. As such, following (1) and (3), the energy saving is a given state, the state-action pair with the highest Q-factor
given as follows: indicates the best action. Fig. 3 shows the flow chart we use
in updating Q-factor and selecting action. In the figure, there
Psaved [n] = (M − m[n]) × Pr . (4) are four steps which are carried at the beginning of a time
slot. Notice that there is no dedicated training phase and there
Reasonably, activating more resource modules will lead to a is an infinite loop after initialization.
smaller energy saving. Following (4), our goal is to maximize
the average energy saved per time slot as given below:
∑k
n=1 Psaved [n]
P̄saved = lim . (5)
k→∞ k
We further define a numerical reward function to assist the
base station in identifying the best action. In time slot n, the Fig. 3. Online Q-Learning for Dynamic Resource Activation.
reward of taking an action a[n] when the state is S[n] is given
as follows:
 Initialize Q-factors: In the initialization step, all Q-factors

M − m[n] − a[n] if u[n + 1] ≤ are set to some initial values. Let Q(S, a) be the Q-factor
r(S[n], a[n]) = (m[n] + a[n]) × U (6) for state S and action a. In the literature, all Q(S, a) are


P otherwise. usually initialized to zero. We find that some state-action pairs
will lead to user blocking with very high probability. For implies the algorithm is not interested in the future because
example, taking action to deactivate more resource modules they may be irrelevant.
is not reasonable when the existing number of active users The proposed online Q-Learning algorithm has a large
has exceeded the supported limit. In other words, taking these state space given by M × U 2 . A large state space lead to
actions is likely to yield a negative reward. If we initialize slow learning and convergence in Q-factors. As such, we
all Q-factors to zero as suggested in the literature, the Q- further revise the state definition using quantization such that
Learning algorithm will learn about this eventually but the S[n] = (⌈u[n]/N ⌉, m[u]), where N is a positive integer
learning process can take too long for our system with rapid number indicating how many active users can be grouped
changing traffic condition. As such, we propose to initialize together to reduce the number of all possible states.
the Q-factors as follows:
 IV. P ERFORMANCE E VALUATION

−1000 if u > (m + a) × U We have performed extensive Monte Carlo simulations to
Q(S = (u, m), a) = and a < 0 (7) evaluate the proposed online Q-Learning algorithm. In the


0 otherwise. simulation, each Q-Learning epoch is aligned with a time slot
and the time slot duration T = 60 seconds. The number of new
Update State: At the beginning of a time slot n, the system user arrival in a time slot n, i.e., λ[n] is a random integer num-
advances its epoch and updates its current state. This is done ber uniformly distributed within the range [β[n] − 5, β[n] + 5],
such that the new state coming out from the execution of the where β[n] is updated once every hour in the first second of
selected action in the previous time slot n − 1 becomes the the hour. The updated value of β[n] is determined as follows:
current state. ( )
2π(960 + n)
Select and Take Action: In this step, ϵ-greedy approach is β[n] = 20 + 8 sin , (9)
used in selecting an action. Let ϵ be a very small positive real 1440
number and ϵ ≤ 1. Then, the algorithm will select the action where this expression tries to simulate the time-varying net-
with the highest Q-factor value only with probability 1 − ϵ. work traffic as illustrated in Fig. 2. The number of user
With probability ϵ, the algorithm chooses equal-probably an departures in a time slot is a random integer number uniformly
action from the remaining feasible actions. This is a trade-off distributed within the range [20, 30]. The following nominal
between exploration and exploitation, where a higher ϵ will values of various simulation parameters are assumed unless
encourage more aggressive exploration for potentially better specified otherwise: M = 30, U = 50, N = 25, P = −1000,
but yet-to-be-known action for a given state. Once an action ϵ = 0.05, γ = 0.1, and α = 0.5. Each simulation is executed
a[n] is selected, it is taken immediately and the system state for 20,000 epochs and each simulation setting is repeated 5
also changes immediately. After executing the selected action times. For a setting, the average value of the 5 simulation
a[n] in time slot n, the algorithm identifies the new system results is plotted as a single data point in the graphs.
state S[n + 1] = (u[n + 1], m[n + 1]) by looking into the base
26
station for the actual number of active users and the number of
Average energy saved (KJoule per time slot)
activated resource modules. With the information of the new

25
state, the reward r(S[n], a[n]) can be determined by (6).
Update Q-factors: With the current state S[n] and the new 24
state S[n + 1] identified, the Q-factors are updated as follows:
( 23
Q(S[n], a[n]) ← (1 − α)Q(S[n], a[n]) + α r(S[n], a[n])

) 22
Learning rate = 0.1

+γ max Q(S[n + 1], a) , (8) 21 Learning rate = 0.3
a∈A[n+1] Learning rate = 0.5
Learning rate = 0.7
where α is the learning rate and γ is the discount factor. Both α 20
Learning rate = 0.9
0 5 10 15 20 25 30
and γ are positive real numbers not larger than 1. The learning Limit of action space
rate affects how aggressive the algorithm is in adopting a
new reward value into its Q-factor. A higher learning rate Fig. 4. Average energy saved (KJoule) per time slot with different learning
rates α, and limits of action space, x.
means the algorithm will adapt to a new environment faster.
In the literature, off-line Q-Learning in training phase requires
a diminishing α with increasing epoch. On the other hand, Fig. 4 shows the average energy saved per time slot at
online Q-Learning needs a non-decreasing α bounded away different learning rates and limits of action space, assuming
from zero. For simplicity, we choose a fixed α with non-zero each resource module consumes 1 unit of energy per time
value. The discount factor affects the presence of the sum of slot. Limit of action space x is a constraint imposed on the
all future rewards in the current time slot. When γ = 0.9, only feasible action set A[n], such that A[n] ← A[n] ∩ {−x, −(x +
90% of the all the future reward earned through the next state 1), · · · , 0, +1, · · · , +(x − 1), +x}. The limit of action space
is considered in the current Q-factor value. A very small γ is introduced to reduce the number of feasible actions to speed
0.0255
up online Q-Learning algorithm’s convergence. We notice that Learning rate = 0.3
Learning rate = 0.5
a higher learning rate leads to a higher energy saving. This is Learning rate = 0.7
0.025
because our network traffic changes rapidly from time to time,
Probability of user blocking

and a small learning rate cannot catch up with the change,
leading to less ideal action selections. At a sufficiently high 0.0245
learning rate, says α = 0.7, a larger action space results in a

lower energy saving. This is because a larger action space 0.024
implies a higher chance in selecting a lousy action in Q-

Learning’s exploration and more occurence of penalties when 0.0235
a bad action is selected. With a high learning rate and a small

action space, the proposed Q-Learning algorithm can achieve 0.023
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
a very high average energy saving. With M = 30, the energy Discount factor
saving is about 84% (25.2/30 × 100) for α = 0.9 and x = 1.
Fig. 7. User blocking probability with different learning rates α, and discount
27 factors, γ.
Average energy saved (KJoule per time slot)
Learning rate = 0.3

Learning rate = 0.5
26 Learning rate = 0.7
25
quality measured in terms of user blocking probability. In
24 general, the blocking probability can be kept low at around
23
0.025. Although this is not as low as 0.01 that we desire for,
the slight increase in blocking probability is probably the price
22
to pay for the very significant energy saving.
21
V. C ONCLUSION
20
We envisage the future green wireless base station to have
19
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
modular resource units that can be separately and dynamically
Discount factor activated or deactivated depending on the network traffic
load. We have proposed an online Q-Learning (reinforcement
Fig. 5. Average energy saved (KJoule) per time slot with different learning
rates α, and discount factors, γ.
learning) algorithm to perform the dynamic resource activation
in face of ever-changing network traffic. Simulation results
confirm that the proposed scheme can achieve significant
Fig. 5 shows the average energy saved per time slot at differ- energy saving as much as 80% without compromising network
ent learning rates and discount factors. The results confirm that service quality since the user blocking probability can be kept
energy saving can reach as high as 80% with proper settings. as low as 0.025.
It is desirable for the Q-Learning to have a high learning rate
and low discount factor, implying learning fast while looking R EFERENCES
not far into the future. [1] Ziaul Hasan, Hamidreza Boostanimehr and Vijay K. Bhargava, “Green
Cellular Networks: A Survey, Some Research Issues and Challenges”,
0.055 IEEE Communications Survey & Tutorials, Fourth Quarter 2011.
[2] “Energy Aware Radio and NeTwork TecHnologies (EARTH)”,
0.05
http://www.ict-earth.eu/.
[3] “Towards Real Energy-efficient Network Design (TREND)”,
Probability of user blocking
0.045
http://www.fp7-trend.eu/.
[4] Holger Claussen, Lester T. W. Ho and Florian Pivit, “Effects of Joint
Macrocell and Residential Picocell Deployment on The Network Energy
0.04
Efficiency”, IEEE PIMRC, September, 2008.
[5] Jyrki T. Louhi, “Energy Efficiency of Modern Cellular Base Stations”,
0.035
International Conference on Telecommunications Energy, pp. 475-476,
October 2007.
0.03 [6] Fred Richter, Albrecht J. Fehske and Gerhard P. Fettweis, “Energy
Learning rate = 0.1
Learning rate = 0.3
Efficiency Aspects of Base Station Deployment Strategies for Cellular
0.025 Learning rate = 0.5 Networks”, IEEE Vehicular Technology Conference, Septermber 2009.
Learning rate = 0.7 [7] Zhisheng Niu, Yiqun Wu, Jie Gong and Zexi Yang, “Cell Zooming
Learning rate = 0.9
0.02 for Cost-Efficient Green Cellular Networks”, IEEE Communications
0 5 10 15 20 25 30
Limit of action space Maganine, Vol. 48, No. 11, pp. 74-79, November 2010.
[8] Salah-Eddine Elayoubi, Louai Saker and Tijani Chahed, “Optimal Con-
Fig. 6. User blocking probability with different learning rates α, and limits trol for Base Station Sleep Mode in Energy Efficient Radio Access
of action space, x. Networks”, IEEE INFOCOM, pp. 106-110, April 2011.
[9] Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning: An
Introduction”, The MIT Press, Cambridge, Massachusetts, United States
of America, 1998.
Fig. 6 and Fig. 7 show that the significant energy saving [10] Christopher J. C. H. Watkins and Peter Dayan, “Technical Notes: Q-
can be achieved without compromising the network service Learning”, Machine Learning, Vol. 8, pp. 279-292, 1992.
View publication stats

Paper Earth Learning 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper Earth Learning 2

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Reinforcement learning approach to dynamic activation of base station

Conference Paper · September 2013

Peng-Yong Kong Dorin Panaitopol

SEE PROFILE SEE PROFILE

cognitive radio View project

The user has requested enhancement of the downloaded file.

E-mail: pengyong.kong@kustar.ac.ae, dorin.panaitopol@nectech.fr

Fig. 2. Average number of new user arrivals in a time slot as a function

Service quality of the wireless cellular network is measured

activated resource modules. With the information of the new

Q(S[n], a[n]) ← (1 − α)Q(S[n], a[n]) + α r(S[n], a[n])

Learning rate = 0.1

Probability of user blocking

learning rate, says α = 0.7, a larger action space results in a

implies a higher chance in selecting a lousy action in Q-

a bad action is selected. With a high learning rate and a small

Learning rate = 0.3

View publication stats

You might also like