Low Complexity Online Radio Access

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

376 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 19, NO.

2, FEBRUARY 2020

Low Complexity Online Radio Access


Technology Selection Algorithm
in LTE-WiFi HetNet
Arghyadip Roy , Student Member, IEEE, Vivek Borkar , Fellow, IEEE,
Prasanna Chaporkar , Member, IEEE, and Abhay Karandikar ,, Member, IEEE

Abstract—In an offload-capable Long Term Evolution (LTE)- Wireless Fidelity (WiFi) Heterogeneous Network (HetNet), we consider
the problem of maximization of the total system throughput under voice user blocking probability constraint. The optimal policy is
threshold in nature. However, computation of optimal policy requires the knowledge of the statistics of system dynamics, viz., arrival
processes of voice and data users, which may be difficult to obtain in reality. Motivated by the Post-Decision State (PDS) framework to
learn the optimal policy under unknown statistics of system dynamics, we propose, in this paper, an online Radio Access Technology
(RAT) selection algorithm using Relative Value Iteration Algorithm (RVIA). However, the convergence speed of this algorithm can be
further improved if the underlying threshold structure of the optimal policy can be exploited. To this end, we propose a novel structure-
aware online RAT selection algorithm which reduces the feasible policy space, thereby offering lesser storage and computational
complexity and faster convergence. This algorithm provides a novel framework for designing online learning algorithms for other
problems and hence is of independent interest. We prove that both the algorithms converge to the optimal policy. Simulation results
demonstrate that the proposed algorithms converge faster than a traditional scheme. Also, the proposed schemes perform better than
other benchmark algorithms under realistic network scenarios.

Index Terms—User association, LTE-WiFi offloading, constrained MDP, threshold policy, stochastic approximation

1 INTRODUCTION

R ECENT developments in wireless communications have


witnessed a proliferation of end-user equipment, such
as tablets and smartphones, with advanced capabilities.
another in a seamless manner. Recent advances in Soft-
ware Defined Networking (SDN) [4], [5] paradigm also
facilitates these mechanisms by bringing more flexibility
The ever-increasing demand for Quality of Service (QoS) in the integrated control and management of various
requirement of users gives rise to the standardization of var- RATs.
ious Radio Access Technologies (RATs) [2], such as Univer- In a Third Generation Partnership Project (3GPP) LTE
sal Mobile Telecommunications System (UMTS), Long HetNet, interworking with IEEE 802.11 [6] Wireless Local
Term Evolution (LTE) etc. According to [3], monthly Area Network (WLAN) (popularly known as Wireless
global mobile data traffic is expected to exceed 49 exa- Fidelity (WiFi)) operating in unlicensed band provides a
bytes by 2021. One of the basic limitations of the present potential solution due to the complementary nature of these
RATs is the lack of support for the co-existence of multi- RATs. The LTE Base Stations (BSs) can be deployed to pro-
ple RATs. Therefore, to address this exceptional growth vide ubiquitous coverage, whereas the WiFi Access Points
of data traffic, network providers have expressed their (APs) offer high bit rate solution in hot-spot regions. A user
interest towards efficient interworking among various can be associated with either LTE or WiFi in an area where
RATs in the context of upcoming Fifth Generation (5G) their coverage overlaps. Furthermore, according to 3GPP
[2] network. This gives rise to the notion of Heteroge- Release 12 specifications [7], data users can be steered from
neous Network (HetNet), where different RATs coexist one RAT to another. This mechanism is known as mobile
and interwork with each other. Users can be associated data offloading.
with any candidate RAT and moved from one RAT to Intelligent RAT selection1 and offloading decisions [8],
[9], [10], [11], [12], [13], [14], [15], [16], [17], [18] may lead to
efficient resource utilization and can be triggered either by
 A. Roy, V. Borkar, and P. Chaporkar are with the Department of Electrical the user or by the network. User-initiated decisions are
Engineering, Indian Institute of Technology Bombay, Mumbai, Mahara-
shtra 400076, India. E-mail: {arghyadip, borkar, chaporkar}@ee.iitb.ac.in. taken with an objective of optimizing individual utilities
 A. Karandikar is with the Department of Electrical Engineering, Indian and hence, may not converge to the global optimality.
Institute of Technology Bombay, Mumbai, Maharastra 400076, India, and Therefore, we focus on network-initiated association and
is the director, Indian Institute of Technology Kanpur, Kanpur, Uttar Pra- offloading schemes which target to optimize different net-
desh 208016, India. E-mail: karandi@ee.iitb.ac.in, karandi@iitk.ac.in. work level system parameters.
Manuscript received 24 Aug. 2018; revised 17 Dec. 2018; accepted 7 Jan.
2019. Date of publication 14 Jan. 2019; date of current version 6 Jan. 2020.
(Corresponding author: Arghyadip Roy.) 1. The terminologies “association” and “RAT selection” are used
Digital Object Identifier no. 10.1109/TMC.2019.2892983 interchangeably throughout this paper.

1536-1233 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
ROY ET AL.: LOW COMPLEXITY ONLINE RADIO ACCESS TECHNOLOGY SELECTION ALGORITHM IN LTE-WIFI HETNET 377

1.1 Related Work


RAT selection and offloading techniques in HetNets can be
either user-initiated [8], [9], [10], [11], [12], [13] or network-
initiated [14], [15], [16], [17], [18], [19], [21], [22] in nature. In
[8], authors propose a user-initiated RAT selection algo-
rithm in an LTE-WiFi HetNet. The proposed algorithm
addresses the trade-off between resource utilization and
QoS. The problem where each user selfishly chooses a RAT
with an objective of maximizing its individual throughput
Fig. 1. LTE-WiFi HetNet architecture.
is considered in [10] and formulated as a non-cooperative
game. Lee et al. [11] proposes a heuristic “on-the-spot off-
loading” scheme where a data user is always steered to
WiFi, whenever it is inside the coverage area of WiFi APs.
In this paper, we propose online learning algorithms to
Among the network-initiated approaches, in [15], opti-
obtain the optimal RAT selection policy for an LTE-WiFi Het-
mal client-AP association problem in a WLAN is considered
Net. The model considered in this paper is similar to that of
within the framework of continuous-time MDP, and user
[19]. As illustrated in Fig. 1, the centralized controller (for
arrival/departure is taken as a feasible decision epoch. In
example, an SDN controller) possesses an overall view of
[22], authors propose an optimal RAT selection algorithm
both the networks and is responsible for taking network-initi-
aiming at maximizing the generated revenue. However, the
ated RAT selection and offloading decisions. We consider
proposed algorithm scales exponentially with the system
voice and data users inside the LTE-WiFi HetNet. Voice users
size. Authors in [16] propose distributed association algo-
are always associated with LTE since voice users require QoS
rithms by formulating the association problem as a non-
guarantee which may not be provided by WiFi. We assume
cooperative game and compare the performance with the
that the data users can be associated with either LTE or WiFi.
centralized globally optimal scheme. A context-aware RAT
Additionally, we assume that data users can be offloaded
selection algorithm is proposed in [17]. The proposed algo-
from one RAT to another, whenever a voice user is associ- rithm, which can be implemented on the user side, albeit
ated, or an existing voice/data user departs. with network assistance, minimizes signaling overhead as
Total system throughput is an important system parameter well as base station computations.
from a network operator’s perspective since the throughput While the above solutions provide significant insight into
experienced by the data users may have a significant impact RAT selection and offloading strategies, none of them spe-
on the profit and customer base of the operator. From the per- cifically focus on computational efficiency. Therefore, prac-
spective of maximization of total system throughput, WiFi tical implementations of the proposed algorithms become
generally provides more throughput than that of LTE when infeasible. There are two main driving factors behind this
WiFi load is less. However, the total throughput in WiFi issue. First, standard dynamic programming techniques
decreases as the load increases [20]. Therefore, although in to solve optimal RAT selection and offloading problems
low load, WiFi may be preferable to data users for association, become computationally inefficient and thus difficult to
as the load increases, LTE may be preferred. LTE resources implement when the state space is large. Furthermore, we
are shared between voice and data users from a common pool need to know the statistics of arrival processes of data
of resources. Throughput requirement of the LTE data users and voice users which govern the transition probabilities
is usually more than that of the voice users. Therefore, LTE between different states, in order to determine the optimal
may reserve some of the resources, which otherwise would policy. In practice, this may be difficult to obtain. Recent
not be efficiently utilized by low throughput voice users, for studies [23], [24], [25] on the characteristics of cellular traffic
data users. Thus, total system throughput maximization may reveal that although the voice traffic can be predicted accu-
result in excessive blocking of voice users. Roy et al. [19] rately, the state-of-art prediction schemes for data traffic are
addresses this inherent trade-off between the total system not very satisfactory. Therefore, obtaining real-time traffic
throughput and blocking probability of voice users by formu- statistics in today’s cellular network is very difficult.
lating this problem as a constrained average reward continu- In the case of unknown system statistics, we may resort to
ous time Markov Decision Process (MDP), which maximizes Reinforcement Learning (RL) [26] algorithms which deter-
the total system throughput subject to a constraint on the voice mine the optimal policy using a trial-and-error method.
user blocking probability. In this paper, we propose online Refs. [1], [21], [27], [28] adopt RL based schemes which can be
learning algorithms which can be implemented without the implemented online. Q-learning [1], [21], [27] is a traditional
knowledge of statistics of arrival processes of voice and data RL algorithm which learns the optimal policy under
users to obtain the optimal RAT selection policy for an LTE- unknown system dynamics. Our preliminary work [1] under-
WiFi HetNet. takes a Q-learning based approach to determine the optimal
One of the main advantage of learning is that it avoids policy. Authors in [21] aim to improve the network perfor-
explicit estimation which may have high variance or may be mance and user experience jointly and formulate the RAT
computationally prohibitive. It replaces the conditional selection problem as a Semi-Markov Decision Process
averaging in an iterative scheme for solving the dynamic (SMDP) problem. A Q-learning based approach is also
programming equation by an actual evaluation at an adopted since network parameters may be difficult to obtain
observed transition and an incremental update which does in reality. Based on locally available information at users and
the conditional averaging implicitly. Also, our scheme has following a Q-learning approach, [27] undertakes distributed
the advantage that it exploits the known threshold structure traffic offloading decisions. Although the convergence to opti-
of the optimal policy to further reduce the computational mality is guaranteed, these learning schemes need to itera-
complexity, and is one of the first to do so. tively learn the value functions for all state-action pairs, thus

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
378 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 19, NO. 2, FEBRUARY 2020

possessing large memory requirement. Additionally, due to  We further exploit the threshold nature of optimal
associated exploration mechanism, their convergence rate is policies [19] and propose a novel structure-aware
slow, especially under large state space. online association algorithm. Theoretical and simula-
tion results indicate that the knowledge of threshold
1.2 Our Contribution property helps in achieving reductions in storage
In this paper, our primary contribution is to propose online and computational complexity as well as in conver-
learning algorithms to maximize the total system through- gence time. We also prove that the proposed scheme
put subject to a constraint on the voice user blocking proba- converges to the true value function.
bility without knowing the statistics of arrival processes of  The proposed structure-aware algorithm provides a
voice/data users in an LTE-WiFi HetNet. To address the novel framework that can be applied for designing
issue of slow convergence of existing learning schemes in online learning algorithms for other problems and
the literature, in this paper, we propose a Post-Decision hence is of independent interest.
State (PDS) learning algorithm which speeds up the learn-  Performances of the proposed algorithms are com-
ing process by removing the action exploration. This pared with other online algorithms in the literature [1].
approach is based on reformulation of the Relative Value  Performances of the proposed algorithms are com-
Iteration Algorithm (RVIA) equation and can be imple- pared with other benchmark RAT selection algo-
mented online in the Stochastic Approximation (SA) frame- rithms under realistic scenarios.
work. Furthermore, the PDS learning algorithm has a lower The rest of the paper is organized as follows. Section 2
space complexity than that of Q-learning [1] because instead describes the system model. In Section 3, the problem for-
of the state-action pair values, we need to store the value
mulation within the framework of constrained MDP is
functions associated with states alone. We also prove the
described. The system model and formulation adopted in our
convergence of the PDS learning RAT selection algorithm to
the optimality. paper is analogous to [1], [14], [19]. The developments
We have shown in [19] that the optimal policy has a described after Section 3 is our point of departure. We intro-
threshold structure, wherein after a certain threshold on the duce the notion of PDS in Section 4. Sections 5 and 6 propose
number of WiFi data users, data users are served using PDS learning algorithm and structure-aware learning
LTE. A similar property exists for the admission of voice algorithm, respectively, for RAT selection in an LTE-WiFi
users [19], where after a certain threshold on the number of HetNet. A comparison of computational and storage com-
LTE data and voice users, voice users are blocked. In this plexities of the proposed and traditional algorithms is pro-
paper, we exploit the threshold properties in [19] and pro- vided in Section 7. Simulation results are presented in
pose a structure-aware learning algorithm which, instead of Section 8, followed by conclusions in Section 9. The proofs are
the entire policy space, searches the optimal policy only available in the supplemental materials file, which is available
from the set of threshold policies. This reduces the conver- in the IEEE Computer Society Digital Library at http://doi.
gence time as well as the computational and storage com- ieeecomputersociety.org/10.1109/TMC.2019.2892983.
plexity in comparison to that of the proposed PDS learning
algorithm. We prove that the threshold vector iterates in the 2 SYSTEM MODEL
proposed structure-aware learning algorithm indeed con-
verge to the globally optimal solution. Note that the analyti- As demonstrated in Fig. 1, we consider a system consisting
cal methodologies presented in this paper to learn the of a WiFi AP inside the coverage area of an LTE BS, both
optimal threshold policy are developed independently and connected to a centralized controller using ideal lossless
can be applied to any learning problem where the optimal links. We assume that voice and data users are present at
policy is threshold in nature. any geographical point in the coverage area of the LTE BS.
Although we make some simplifying assumptions to Data users outside the common coverage area of the LTE BS
facilitate the analysis, performance of the proposed schemes and the WiFi AP always get associated with the LTE BS.
are studied in realistic conditions without the simplifying Therefore no decision is involved in this case. Hence with-
assumptions. Extensive simulations are conducted in ns-3 out loss of generality, we take into account only those data
[29], a discrete event network simulator, to characterize the users which are present in the dual coverage area of the
performance of the proposed algorithms. It is observed LTE BS and the WiFi AP. Data users can be associated with
through simulations that the proposed structure-aware either the LTE BS or the WiFi AP. We assume that in LTE,
RAT selection online learning algorithm outperforms the voice and data users are allocated resources from a common
PDS learning algorithm, providing faster convergence to resource pool. We assume that voice and data user arrivals
optimality. Performance comparison of the proposed algo- are Poisson processes with means v and d , respectively.
rithms is made with Q-learning based RAT selection Service times for voice and data users follow exponential
algorithm [1]. Furthermore, we observe that the proposed distributions with means m1v and m1 , respectively. Assump-
d
algorithms outperform other benchmark algorithms under tions on service times follow justifications in [30]. All the
realistic network scenarios like presence of channel fading, users are assumed to be stationary.
dynamic resource scheduling and user mobility. We use
3GPP recommended parameters for the simulations. Remark 1. For brevity of notation, we have considered a
Our key contributions can be summarized as follows. single LTE BS and a single WiFi AP. However, the system
model can be generalized to multiple LTE BSs and WiFi
 Based on the PDS paradigm, we propose an online APs with small modifications. When the coverage areas
algorithm for RAT selection in an LTE-WiFi HetNet. of multiple APs/BSs do not overlap, we need to consider
The convergence proof for the proposed algorithm is the number of users in each AP/BS in the state space. In
provided. case of multiple overlapping APs/BSs, we can construct a

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
ROY ET AL.: LOW COMPLEXITY ONLINE RADIO ACCESS TECHNOLOGY SELECTION ALGORITHM IN LTE-WIFI HETNET 379

one-to-one mapping between a user location and an TABLE 1


AP/a BS using a simple criterion such as highest aver- Transition Probability Table
age signal strength, etc. Using this criterion for multiple
overlapping APs, the problem can be reduced to single ajEl ði0 ; j0 ; k0 Þ
BS and multiple non-overlapping AP problem. The set A1 jðE1 jj . . . jjE5 Þ ði; j; kÞ
where more than one APs have equal signal strength is A2 jE1 ði þ 1; j; kÞ
non-generic in the sense that in the associated parameter A2 jE2 ði; j þ 1; kÞ
space it has Lebesgue measure 0. A3 jE2 ði; j; k þ 1Þ
A4 jE1 ði þ 1; j  1; k þ 1Þ
A5 jðE3 jjE4 Þ ði; j þ 1; k  1Þ
2.1 State & Action Space A5 jE5 ði; j  1; k þ 1Þ
The system can be modeled as a controlled continuous
time stochastic process fXðtÞgt0 . Any state s in the state
space S is symbolically represented as s ¼ ði; j; kÞ, where
i; j represent the number of voice and data users in LTE, case of departure of voice/data users, the set of all possible
respectively. The number of data users in WiFi is denoted actions is fA1 ; A5 g. However, depending on the system
by k. The arrival and departure of voice and data users are state, one or more actions may be infeasible. Note that
taken as decision epochs. Due to the Markovian nature of blocking is considered as a feasible action for voice (data)
the system, it suffices to observe the system state at each of users, only when the system is non-empty (the capacity is
the decision epochs. Note that the system state changes only reached for both LTE and WiFi).
at these decision epochs. Therefore, it is not required to con-
sider the system state at other points in time. 2.2 State Transitions
The state of the system changes whenever there is an From each state s, under an action a, the system makes a
arrival or a departure of voice/data users, referred to as transition to a different state s0 with positive probability
events. We consider five type of events, namely, an arrival pss0 ðaÞ. In state s ¼ ði; j; kÞ, let the sum of arrival and service
of a voice user in the system (E1 ), an arrival of a data user in rates of users be denoted by vði; j; kÞ. Therefore, we have
the system (E2 ), a departure of an existing voice user from
LTE (E3 ), a departure of an existing data user from LTE (E4 ) vði; j; kÞ ¼ v þ d þ imv þ jmd þ kmd :
and a departure of an existing data user from WiFi (E5 ). At
every decision epoch, the centralized controller takes a deci- Then
sion based on the present system state. Based on the chosen 8 
>
>
v
; s0 ¼ ði0 ; j0 ; k0 Þ;
action, the system moves to different states with finite > vði0 ;j0 ;k0 Þ
>
>
>
probabilities. >
>
 d
vði0 ;j0 ;k0 Þ ; s0 ¼ ði0 ; j0 ; k0 Þ;
>
< 0
We assume that ði; j; kÞ 2 S if ði þ jÞ  C and k  W , im
where C is the total number of common resource blocks for pss0 ðaÞ ¼ vði0 ;j0v;k0 Þ ; s0 ¼ ði0  1; j0 ; k0 Þ;
>
>
voice and data users in LTE resource pool, and W denotes the >
> j00 m0d 0 ;
>
> s0 ¼ ði0 ; j0  1; k0 Þ;
maximum number of users in WiFi to guarantee that the aver- > vði ;j ;k Þ
> 0
>
: k md ;
age per-user throughput in WiFi is more than a certain thresh- vði0 ;j0 ;k0 Þ s0 ¼ ði0 ; j0 ; k0  1Þ:
old. Note that, the per-user throughput in WiFi monotonically
decreases with the number of WiFi data users [20]. The first Values of i0 ; j0 and k0 as a function of different actions a
condition signifies that each admitted user is provided a sin- (conditioned on events El ) are summarized in Table 1.
gle resource block, whenever resources are available.
Remark 2. For sake of simplicity, we have considered sin- 2.3 Rewards and Costs
gle resource block allocation to LTE users. Although the Let the reward and cost functions per unit time for state s
consideration of multiple resource blocks mimics the and action a be denoted by rðs; aÞ and cðs; aÞ, respectively.
practical scenario better, it complicates the system model We assume that before the data transfer starts, a Transmission
without bringing much difference in the formulation Control Protocol (TCP) connection is built between the data
developed in this paper. user and the LTE BS (WiFi AP). Let RL;V and RL;D denote the
bit rates of voice and data users in LTE, respectively. We
Let the action space, i.e., the set of all possible association assume that RL;D is the maximum data rate provided to the
and offloading strategies in case of arrival and departure of data users by the TCP pipe in LTE. In LTE network, generally
users, be denoted by A. We describe A below: voice user generates traffic at Constant Bit Rate (CBR). There-
8 fore, we assume RL;V to be a constant. RW;D ðkÞ denotes the
>
> A1 ; Block the arriving user or do nothing per-user data throughput of k users in WiFi, assuming the full
>
>
>
> during departure; buffer traffic WiFi model [20]. The calculation of RW;D ðkÞ takes
>
>
> A2 ;
> Accept voice/data user in LTE, into account factors like the contention-based medium access
<
A3 ; Accept data user in WiFi, of WiFi users, success and collision probabilities as well as

>
> A4 ; Accept voice user in LTE and offload one slot times for successful transmission, idle slots and busy slots
>
>
>
> data user to WiFi; corresponding to collisions.
>
>
>
> A ; Move one data user to a RAT (from which The reward rate for a state-action pair, which is a
: 5
departure has occurredÞ: monotone increasing function of the number of voice
and data users in LTE, is defined as the total throughput
In case of voice and data user arrivals, the sets of all possible of the system in the state under the corresponding
actions are fA1 ; A2 ; A4 g and fA1 ; A2 ; A3 g, respectively. In action. The entire description of reward rates in state

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
380 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 19, NO. 2, FEBRUARY 2020

TABLE 2 3.2 Equivalent Discrete-Time MDP and Lagrangian


Reward Rate Table Approach
To obtain the optimal policy using RVIA [32], we need to
ðajEl Þ rðs; aÞ employ the Lagrangian approach [31]. In this approach, for
ðA1 jE1 Þ iRL;V þ jRL;D þ kRW;D ðkÞ a fixed value of Lagrange Multiplier (LM) b, the reward
ðA1 jE2 Þ iRL;V þ jRL;D þ kRW;D ðkÞ function is given by
ðA1 jE3 Þ ði  1ÞRL;V þ jRL;D þ kRW;D ðkÞ
ðA1 jE4 Þ iRL;V þ ðj  1ÞRL;D þ kRW;D ðkÞ rðs; a; bÞ ¼ rðs; aÞ  bcðs; aÞ:
ðA1 jE5 Þ iRL;V þ jRL;D þ ðk  1ÞRW;D ðk  1Þ
ðA2 jE1 Þ ði þ 1ÞRL;V þ jRL;D þ kRW;D ðkÞ The dynamic programming equation described below pro-
ðA2 jE2 Þ iRL;V þ ðj þ 1ÞRL;D þ kRW;D ðkÞ vides the necessary condition for optimality in case of
ðA3 jE2 Þ iRL;V þ jRL;D þ ðk þ 1ÞRW;D ðk þ 1Þ SMDP 8s 2 S, where s0 2 S
ðA4 jE1 Þ ði þ 1ÞRL;V þ ðj  1ÞRL;D þ ðk þ 1ÞRW;D ðk þ 1Þ
X
ðA5 jE3 Þ ði  1ÞRL;V þ ðj þ 1ÞRL;D þ ðk  1ÞRW;D ðk  1Þ V ðsÞ ¼ max½rðs; a; bÞ þ pss0 ðaÞV ðs0 Þ  rtðs; aÞ;
ðA5 jE4 Þ iRL;V þ jRL;D þ ðk  1ÞRW;D ðkÞ a
s0
ðA5 jE5 Þ iRL;V þ ðj  1ÞRL;D þ kRW;D ðkÞ
where V ðsÞ; r; tðs; aÞ denote the value function of state
s 2 S, the optimal average reward and the mean transition
s ¼ ði; j; kÞ for different events and actions is provided in time from state s under the action a, respectively.
Table 2. Whenever the centralized controller blocks an Since the sojourn times are exponential, this is a special
incoming voice user, the cost rate is one unit, else it is case of continuous time controlled Markov chain for which
zero. Therefore we have
 X
1; if voice user is blocked; 0 ¼ max½rðs; a; bÞ  r þ qðs0 js; aÞV ðs0 Þ; (2)
cðs; aÞ ¼ a
s0
0; otherwise:
where qðs0 js; aÞ are controlled P transition rates satisfying
qðs0 js; aÞ  0 for s0 6¼ s and s0 qðs0 js; aÞ ¼ 0. If we scale all
3 PROBLEM DESCRIPTION
the transition rates by a positive scalar, it amounts to time
Association policy is a sequence of decision rules which scaling which scales the average reward accordingly for
describes actions to be chosen at different states and every policy including the optimal, but does not change the
decision epochs. Maximization of total system through- optimal policy. Thus, without loss of generality, we assume
put may result in blocking of voice users as the contribu- that qðsjs; aÞ 2 ð0; 1Þ 8a, implying in particular that
tion of voice users towards the total system throughput qðs0 js; aÞ 2 ½0; 1 for s0 6¼ s. Adding V ðsÞ to both sides of
is less than that of data users. Therefore, to address the Equation (2), we have
inherent trade-off between the total system throughput X
and the voice user blocking probability, we aim to obtain V ðsÞ ¼ max½rðs; a; bÞ  r þ pss0 ðaÞV ðs0 Þ; (3)
a
an association policy which maximizes the total system s0
throughput, subject to a constraint on the voice user
blocking probability. Since arrival and departure of users where pss0 ðaÞ ¼ qðs0 js; aÞ for s0 6¼ s and pss0 ðaÞ ¼ 1 þ qðs0 js; aÞ
can happen at any point in time, this problem can be for s0 ¼ s (recall that qðsjs; aÞ is negative). This equation is
formulated as a constrained continuous time MDP. It is the DP equation for a discrete time MDP (say fXn g) with
well-known that a randomized stationary optimal policy, controlled transition probabilities pss0 ðaÞ. Here onwards we
i.e., a mixture of pure policies with associated probabili- focus on discrete time setting as described in Equation (3).
ties, exists [31]. For a fixed value of b, the following equation describes
how RVIA can be used to solve the equivalent uncon-
strained maximization problem
3.1 Problem Formulation
Let M be the set of all memoryless policies. We assume that X
Vnþ1 ðsÞ ¼ max½rðs; a; bÞ þ pss0 ðaÞVn ðs0 Þ  Vn ðs Þ; (4)
the Markov chains constructed under such policies are uni- a
s0
chain and therefore have unique stationary distributions. Let
the average reward and cost of the system over infinite hori- where Vn ð:Þ is an estimate of the value function after nth
zon under the policy M 2 M be denoted by V M and BM , iteration and s is an arbitrary but fixed state. Next, we aim
respectively. Let RðtÞ and CðtÞ be the total reward and cost to determine the value of b (¼ b , say) which maximizes the
of the system up to time t, respectively. For the constrained average expected reward, subject to the cost constraint. The
MDP problem, our objective is described as follows: following gradient descent algorithm describes the rule to
update the value of b
1
Maximize: V M ¼ lim EM ½RðtÞ; 1
t!1 t bkþ1 ¼ bk þ ðBpbk  Bmax Þ;
(1) k
1
subject to: BM ¼ lim EM ½CðtÞ  Bmax ; where bk is the value of b in kth iteration, and Bpbk is the
t!1 t
voice user blocking probability at kth iteration. Once the
where EM denotes the expectation operator under policy M, value of b is determined, we obtain the optimal policy by
and Bmax denotes the constraint on the blocking probability employing a perturbation of b by a small amount  in both
of voice users. As we know that the optimal policies are sta- directions (policies pb  and pb þ , say) with associated
tionary, the limits in Equation (1) exist. costs Bb  and Bb þ , respectively. Finally, we have a

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
ROY ET AL.: LOW COMPLEXITY ONLINE RADIO ACCESS TECHNOLOGY SELECTION ALGORITHM IN LTE-WIFI HETNET 381

state s0 be denoted by pð^ s; s0 Þ. The post-decision Bellman


equation for the post-decision state s^ ¼ ð^x; y^; z^Þ 2 S is
X
0 0
V ð^
^ sÞ ¼ s; s Þ max½rðs ; a; bÞ þ V^ðs^0 Þ  r;
pð^ (5)
a
s0
Fig. 2. Transition among PDSs and pre-decision states.
where s^0 is the post-decision state when action a in chosen in
pre-decision state s0 . Using Equation (5), the RVIA based
randomized optimal policy where the the policies pb  and
update rule is as follows:
pb þ are chosen with probability p and ð1  pÞ, such that
X
V^nþ1 ð^
sÞ ¼ pð^ s; s0 Þ max½rðs0 ; a; bÞ þ V^n ðs^0 Þ  V^n ðs^ Þ;
pBb  þ ð1  pÞBb þ ¼ Bmax : s 0
a
(6)
We know [33] that the optimal stationary policy can be ran- V^nþ1 ðs^00 Þ ¼V^n ðs^00 Þ 8s^00 6¼ s^;
domized in at most one s 2 S where the optimal action is
randomized between two actions. where s^ is the PDS associated with the nth decision epoch
The gradient descent scheme for b converges to b . In and s^ is a fixed PDS. The idea is to update one component
Equation (1), instead of considering limiting values of point- at a time and keep the others unchanged. This idea is trans-
wise reward and cost, we consider limiting values of average lated into an online algorithm stated below which updates
reward and cost. Therefore, no relaxation of the constraint is the value function of the PDS associated with the current
performed, and the obtained solution is optimal. decision epoch.

4 POST-DECISION STATE FRAMEWORK 5 ONLINE RAT SELECTION ALGORITHM


As discussed in the previous section, the optimal policy can The system changes states based on different events, i.e., the
be determined using RVIA (Equation (4)), if the transition arrival/departure of users and various actions taken in dif-
probabilities pss0 ðaÞ are known beforehand. However, ferent states. Since we do not know the arrival rates of voice
knowledge of transition probabilities requires the knowl- and data users, and the max operator occurs outside the
edge of statistics of arrival processes of voice and data users. averaging operation with respect to the transition probabili-
In reality, it may be difficult to obtain these parameters [23], ties of the underlying Markov chain in Equation (4), online
[24], [25]. Therefore we aim to devise an online scheme implementation of the same is not feasible. However, in
which does not require the knowledge of statistics of arrival Equation (6), the expectation operation which resides out-
process and can still converge to the optimal solution. How- side the max operation can be replaced by averaging over
ever, before describing the online algorithm, we introduce time to estimate the optimal value function of the PDSs.
the notion of the PDS. Using the theory of SA [34], we can remove the expectation
A PDS is defined to be an imaginary state of the system operation in Equation (6) and still converge to optimality in
just after an action is chosen and before the unknown sys- policy by doing averaging over time.
tem dynamics (noise) adds into the system. The idea behind Let gðnÞ be a positive step-size sequence possessing the
PDS is to factor the transition from one state to another into following properties:
known and unknown components. The known component X
1 X
1
consists of the effect of the action taken in a state, whereas gðnÞ ¼ 1; ðgðnÞÞ2 < 1: (7)
the unknown component comprises the unknown random n¼1 n¼1
dynamics of the system (viz., the arrival and departure of Let hðnÞ be another step-size sequence having the same
voice and data users). Let us assume that the state of the sys- properties as that of Equation (7) along with the additional
tem is s ¼ ði; j; kÞ 2 S at some decision epoch. Based on the property
chosen action, the system moves to the post-decision state
s^ ¼ ði;
^ j; ^ 2 S. Based on the next event, the system moves
^ kÞ hðnÞ
lim ! 0: (8)
to the actual pre-decision state s0 ¼ ði0 ; j0 ; k0 Þ 2 S. Through- n!1 gðnÞ
out this paper, whenever we refer to a “state”, we always
refer to a pre-decision state. An example transition involv- The key idea is to update the value function associated with
ing pre-decision states and PDSs is illustrated in Fig. 2. one PDS at a time and keep the other PDS values unchanged.
Under action A3 , the system makes a transition from state Let Yn be the PDS P which is updated at nth iteration. Also,
s ¼ ði; j; kÞ to PDS s^ ¼ ði; j; k þ 1Þ. Under the next event E3 , s; nÞ ¼ nm¼0 If^
define gð^ s ¼ Yn g, i.e., number of times PDS s^
the system moves from the PDS s^ to the pre-decision state is updated till nth iteration. The scheme is as follows:
s0 ¼ ði  1; j; k þ 1Þ. In other words, the known information
regarding the transition from state s to s0 is incorporated in V^nþ1 ð^
sÞ ¼ ð1  gðgð^
s; nÞÞÞV^n ð^ s; nÞÞfmax½rðs0 ; a; bÞ
sÞ þ gðgð^
a
PDS s^. On the other hand, transition from PDS s^ to state s0 þ V^n ðs^0 Þ  V^n ðs^ Þg; (9)
consists only of the unknown system dynamics which is not
included in the PDS. Let V^ð^ sÞ be the value function associ- Vnþ1 ðs Þ ¼ V^n ðs^00 Þ 8s^00 6¼ s^:
^ ^00

ated with the PDS s^ 2 S. Thus, we have However, the scheme (9) is a primal RVIA algorithm which
solves a dynamic programming equation for a fixed value
sÞ ¼ Es0 ½V ðs0 Þ;
V^ð^ of LM b. To obtain optimality in b, b is to be iterated along
the timescale hðnÞ, as described below:
where the expectation Es0 is taken over all the pre-decision
states which are reachable from the post-decision state s^. bnþ1 ¼ L½bn þ hðnÞðBn  Bmax Þ; (10)
Let the transition probability from PDS s^ to pre-decision

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
382 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 19, NO. 2, FEBRUARY 2020

where the projection operator L projects the value of LM to utilizing the threshold nature of optimal policy, the policy
the interval ½0; L for a large L > 0. Therefore, the primal- space can be reduced significantly. To this end, we propose
dual RVIA can be described as follows. a structure-aware online learning algorithm which searches
If the system is at PDS s^ at the nth iteration, then do the the optimal policy only from the set of threshold policies,
following: providing faster convergence than PDS learning Algorithm.
Note that independent methodologies which are developed
V^nþ1 ð^
sÞ ¼ ð1  gðgð^
s; nÞÞÞV^n ð^ s; nÞÞfmax½rðs0 ; a; bÞ
sÞ þ gðgð^ in this section can be applied to any learning problem hav-
a
ing similar structural properties.
þ V^n ðs^0 Þ  V^n ðs^ Þg; (11)
V^nþ1 ðs^00 Þ ¼ V^n ðs^00 Þ 8s^00 6¼ s^;
6.1 Gradient Based Online Algorithm
bnþ1 ¼ L½bn þ hðnÞðBn  Bmax Þ: (12) Let the throughput increment in WiFi when the number
of WiFi users increases from k to ðk þ 1Þ be denoted by
The assumptions on gðnÞ and hðnÞ (Equations (7) and (8)) ~W;D ðkÞ. Therefore, R~W;D ðkÞ ¼ ðk þ 1ÞRW;D ðk þ 1Þ  kRW;D ðkÞ.
R
ensure that two quantities are updated on two different We assume the following.
timescales. The value of LM is updated on a slower time-
scale than that of the value function. From the slower LM Assumption 1. R~W;D ðkÞ is a non-increasing function of k.
timescale point of view, V^ð^
sÞ appears to be equilibrated in This assumption is in line with the full buffer traffic
accordance with the current LM value, and from the faster model [20].
timescale view, LM appears to be almost constant. This Summary of the structural properties of the optimal pol-
two-timescale scheme induces a “leader-follower” behavior. icy is as follows. Detailed proofs of the structural properties
The slow (fast) timescale iterate does not interfere in the can be found in [19].
convergence of the fast (slow) timescale iterate.
(1) Upto a threshold on the number of WiFi data users
Theorem 1. The schemes (11)-(12) converges to (V^; b ) “almost (say kth ) serve data users in WiFi (A3 ) and then serve
surely”(a.s.). them using LTE (A2 ) until LTE is full. When LTE is
Proof. Proof is provided in Section 10.1. in the supplemen- full, i.e., ði þ jÞ ¼ C, the optimal policy is to serve all
tal material file. u
t data users using WiFi until k ¼ W , where an incom-
ing data user is blocked.
Based on the analysis presented above, the two timescale (2) f8i; jjði þ jÞ < Cg and a voice user arrival, A4 ðA2 Þ is
PDS online learning algorithm is described in Algorithm 1. better than A2 ðA4 Þ if k < kth ðk  kth Þ.
As described in the algorithm, value functions associated (3) f8i; jjði þ jÞ < Cg and a voice user arrival, if the
with different states, the LM and the number of iterations optimal action in state ði; j; kÞ is blocking, then the
are initialized at the beginning. Based on a random event optimal action in state ði þ 1; j; kÞ is blocking.
(arrival or departure of voice/data user), the system state is (4) f8i; jjði þ jÞ ¼ Cg and a voice user arrival, if the opti-
initialized. When the current PDS of the system is s^, the sys- mal action in state ði; j; kÞ is blocking, then the opti-
tem chooses an action which maximizes the R.H.S expres- mal action in state ði þ 1; j  1; kÞ is blocking.
sion in Equation (9). Based on the observed reward in the Using the first two properties, we can eliminate a number of
current PDS s^0 , V^ð^
sÞ is updated along with the LM. This suboptimal actions. In the case of data user arrival (event
process is repeated for every decision epoch. E2 ) and departure of voice and data users (events E3 ; E4
and E5 ), a single decision is involved. This may provide
improved convergence because contrary to an online
Algorithm 1. PDS Learning Algorithm algorithm without any knowledge of structural property,
1: Initialize number of iterations k 1, value function vector we no longer need to learn optimal actions in some states.
V^ð^
sÞ 0; 8^s 2 S and the LM b 0. The only event where multiple decisions are involved is the
2: while TRUE do voice user arrival (event E1 ). As stated in Property 3 and 4,
3: Determine the event (arrival/departure) in the current the value of the threshold on i, where the optimal action
decision epoch. changes to blocking, is a function of j and k. Thus, if we
4: Choose action a which maximizes the R.H.S expression in have the knowledge of the values of thresholds, we can
Equation (9). characterize the policy completely. The idea is to optimize
5: Update the value function of PDS s^ using (9). over the threshold vector (say u) using an update rule, so
6: Update the LM according to Equation (10). that the value of the threshold vector u converges to the
7: Update s^ s^0 and k k þ 1. optimal value. Before proceeding further, we determine the
8: end while dimension of u using the analysis presented below.
Using Property 1 and 2, we can identify three regions.

6 STRUCTURE-AWARE ONLINE RAT SELECTION (1) 0  k < kth : Using Property 1, we have j ¼ 0. For
each value of k, we need to know the value of thresh-
ALGORITHM old which belongs to the set f0; 1; ::; Cg.
In this section, we propose a learning algorithm exploiting (2) k ¼ kth : Using Property 1, k ¼ kth ) j  0. Thus, it
the threshold properties of the optimal policy. The PDS boils down to computing a single threshold which
learning algorithm proposed in Section 5 does not take into belongs to the set f0; 1; ::C  jg (Property 3), for each
account the threshold nature of the optimal policy and value of j ð0  j < C). Also, we need to compute a
hence optimizes over the entire policy space. However, single threshold for ði þ jÞ ¼ C (Property 4).

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
ROY ET AL.: LOW COMPLEXITY ONLINE RADIO ACCESS TECHNOLOGY SELECTION ALGORITHM IN LTE-WIFI HETNET 383

(3) W > k > kth : Using Property 1, k > kth ) Therefore, according to the two-timescale gradient based
ði þ jÞ ¼ C. Thus, using Property 4, we need to learning framework, on the faster timescale, we have
obtain the threshold of blocking for ðW  kth  1Þ V nþ1 ðs; uÞ ¼ ð1  gðgðs; nÞÞÞV n ðs; uÞ þ gðgðs; nÞÞ½rðs; a; bÞ
values of k.
þ V n ðs0 ; uÞ  V n ðs ; uÞ; (16)
Therefore the dimension of u ¼ ðkth þ C þ W  kth Þ =
ðC þ W Þ. V nþ1 ðs ; uÞ ¼ V^n ðs00 ; uÞ; 8s00 6¼ s:
00

Remark 3. When the state space becomes too large, then it For example, if the current state is s ¼ ði; 0; 0Þ and ði < un ð0ÞÞ,
becomes cumbersome to represent a policy since this then state transition is determined by P0 ðs0 jsÞ (accept in LTE
requires tabulating actions corresponding to each state. (A2 )), i.e., s0 ¼ ði þ 1; 0; 0Þ, else, s0 is determined by P1 ðs0 jsÞ
Due to the threshold nature of the optimal policy, the (blocking (A1 )), i.e., s0 ¼ ði; 0; 0Þ. However, value functions
representation using the threshold vector becomes com- corresponding to other states are kept unchanged.
putationally efficient. Instead of storing the optimal Note that, the above scheme works for a fixed value of
action corresponding to each state, we just need to store threshold vector u and LM b. To obtain the optimal value
ðC þ W Þ individual thresholds. of u, u is to be iterated along the slower timescale hðnÞ.
Note that, although individual components of the thresh-
We consider a class of threshold policies which can be
old take discrete values, we interpolate them to the contin-
described in terms of the threshold vector u. The main idea
uous domain to be able to apply the online update rule.
behind the online algorithm is to compute the gradient of the
Since the threshold policy is a step function (governed by
system metric, i.e., the average reward of the system, with
P0 ðs0 jsÞ up to a threshold and P1 ðs0 jsÞ, thereafter) defined
respect to u and improve the policy by updating the value of
at discrete points, Assumption 2 is not satisfied at every
u in the direction of the gradient. Therefore, following [35],
point. Therefore we approximate the threshold policy in
one needs to compute the gradient of the system metric. To
state s by a randomized policy which is a function of u
express the dependence of the parameters associated with the
(fðs; uÞ, say). We define
underlying Markov chain on u explicitly, we need to redefine
the notations. Let the transition probability associated with Pss0 ðuÞ  P1 ðs0 jsÞfðs; uÞ þ P0 ðs0 jsÞð1  fðs; uÞÞ;
the Markov chain fXn g as a function of u be given by
ðiuðT Þ0:5Þ
where fðs; uðT ÞÞ ¼ 1þe
e
ðiuðT Þ0:5Þ in state s ¼ ði; j; kÞ, provides
Pss0 ðuÞ ¼ P ðXnþ1 ¼ s0 jXn ¼ s; uÞ:
a convenient approximation to the step function.
Assumption 2. We assume that for every s; s0 2 S, Pss0 ðuÞ is a Remark 4. The rationale behind the choice of this function
bounded, twice differentiable function, and the first and second is the fact that it is continuously differentiable, and the
derivative of Pss0 ðuÞ is bounded. derivative is nonzero everywhere.
Let the average reward of the Markov chain, steady state While designing an online update scheme for u, instead
stationary probability of state s, value function of state s (as of rrðun Þ (See Equation (14)), we can evaluate rPss0 ðuÞ. The
a function of u) be denoted by rðuÞ, pðs; uÞ and V ðs; uÞ, steady-state stationary probabilities inside the summation
respectively. The following proposition provides a closed- inside Equation (13) can be omitted by performing averag-
form expression for the gradient of the average reward of ing over time. We have
the system. A proof for the same can be found in [35].
Although [35] considers a generalized case where the rPss0 ðuÞ ¼ ðP1 ðs0 jsÞ  P0 ðs0 jsÞÞrfðs; uÞ: (17)
reward function depends on u, in our case the same proof
holds with the exception that the gradient of the reward In the right hand side of Equation (17), we incorporate a mul-
function is zero. tiplication factor of 12 since multiplication by a constant term
does not alter the online scheme. The physical significance of
Proposition 1. Under assumptions on Pss0 ðuÞ as stated before, this operation is that at any iteration, we have state transi-
we have tions according to P0 ð:j:Þ and P1 ð:j:Þ with equal probabilities.
X X The update of u in the slower timescale hðnÞ is as follows:
rrðuÞ ¼ pðs; uÞ rPss0 ðuÞV ðs0 ; uÞ: (13)
s0 2S
unþ1 ðT Þ ¼ DT ½un ðT Þ þ hðnÞrfðs; un ðT ÞÞð1Þan V n ðs0 ; un Þ;
s2S
(18)
Hence, we can compute the value of rrðuÞ (or rPss0 ðuÞ) unþ1 ðT 0 Þ ¼ un ðT 0 Þ 8T 0 ¼
6 T;
to construct an incremental scheme similar to a stochastic
gradient algorithm for the threshold values, of the form where an is a random variable which takes values 0 and 1
with equal probabilities. When it takes the value 0, then s0 is
unþ1 ¼ un þ hðnÞrrðun Þ; (14) determined by P1 ðs0 jsÞ, otherwise by P0 ðs0 jsÞ. The averaging
property of SA then leads to the effective drift (17). Depend-
where un represents the value of threshold vector in nth iter- ing on the state visited, the T th component of the vector u is
ation on the slower timescale hðnÞ. Given a threshold u, we updated as shown in Equation (18). For example, if the cur-
assume that the state transition in state s ¼ ði; j; kÞ is given rent state is ð1; 0; 0Þ, then un ð0Þ is updated (See Equa-
by P0 ðs0 jsÞ, if i < uðT Þ and P1 ðs0 jsÞ, otherwise, where uðT Þ tion (15)), and other components are kept unchanged. The
denotes the component of u which corresponds to state s. projection operator DT is a function which ensures that the
Specifically iterates remain bounded in the interval ½0; MðT Þ], where
 
k þ j; ði þ jÞ 6¼ C; C  ðT  kth Þ; if ðkth þ CÞ  T  kth ;
T ¼ (15) MðT Þ ¼ (19)
C þ k; ði þ jÞ ¼ C: C; else:

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
384 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 19, NO. 2, FEBRUARY 2020

TABLE 3 i.e., jSj values. While updating the PDS value function, PDS
Computational and Storage Complexities of Different Algorithms learning algorithm evaluates jAj functions, resulting in a
per-iteration complexity of jAj.
Algorithm Storage complexity Computational In the case of structure-aware learning algorithm, we no lon-
complexity
ger need to store jSj value functions. Rather, by virtue of the
Q-learning [1] OðjSj jAjÞ OðjAjÞ threshold nature of optimal policy, we consider three cases.
PDS learning OðjSjÞ ¼ OðC 2 W Þ OðjAjÞ
Structure-aware OðC 2 þ CW Þ Oð1Þ (1) 0  k < kth : Since we have j ¼ 0, for each value of k,
learning we need to store ðC þ 1Þ value functions.
(2) k ¼ kth : k ¼ kth ) j  0. Thus, we need to store
ðC þ 1  jÞ value functions, for each value of
Similar to Algorithm 1, to obtain the optimal value of b, b is to j ð0  j  C).
be iterated along the same timescale hðnÞ, as specified below: (3) W  k > kth : k > kth ) ði þ jÞ ¼ C. Therefore, we
need to store value functions of ðC þ 1Þ states for
bnþ1 ¼ L½bn þ hðnÞðBn  Bmax Þ; (20)
each value of k.
Therefore, the total number of value functions which need
Remark 5. The dynamics of LM and threshold vector are to be stored is ðC þ 1Þkth þ ðCþ1ÞðCþ2Þ
2 þ ðC þ 1ÞðW  kth Þ,
not dependent on each other directly. However, both b
which is equal to ðCþ1ÞðCþ2Þ þ ðC þ 1ÞW . Note that, this is a
and u iterates depend on value functions in the faster 2
timescale. Therefore u is updated in the same timescale as considerable reduction in storage complexity in comparison
that of b, without requiring a third timescale. to the PDS learning scheme having a storage complexity of
OðC 2 W Þ. For example, when W ¼ C, the storage complexity
Theorem 2. The schemes (16), (18) and (20) converge to opti- reduces from OðC 3 Þ to OðC 2 Þ. Furthermore, feasible actions
mality a.s. corresponding to each state need not be stored separately
Proof. Proof is provided in Section 10.2. in the supplemen- since the threshold vector completely characterizes the pol-
tal material file. u
t icy. The per-iteration computational complexity of this
scheme (see Equation (16)) is Oð1Þ. This scheme also involves
Based on the analysis described above, the structure- updating a single component of the threshold vector (Equa-
aware online learning algorithm is stated in Algorithm 2. As tion (18)) with a computational complexity of Oð1Þ.
described in the algorithm, value functions associated with
different states, the LM, the threshold vector and the number
of iterations are initialized at the beginning. When the current
8 SIMULATION RESULTS
state of the system is s, the system chooses the action which is In this section, proposed PDS learning and structure-aware
given by the current value of the threshold vector. Based on learning algorithms are simulated in ns-3 to characterize
the observed reward, V ðsÞ and u is updated along with the and compare their convergence behaviors. Convergence
LM. This process is repeated for every decision epoch. rates of the proposed algorithms are compared with that of
the Q-learning, as proposed in our earlier work [1]. Simula-
Algorithm 2. Structure-Aware Learning Algorithm tion results establish that the proposed PDS learning algo-
rithm provides improved convergence than Q-learning.
1: Initialize number of iterations k 1, value function Furthermore, it is observed that the knowledge of structural
V ðsÞ 0; 8s 2 S, the LM b 0 and the threshold vector properties indeed reduces the convergence time.
u 0.
2: while TRUE do
8.1 Simulation Model and Evaluation Methodology
3: Choose action a given by the current value of threshold
vector u. The simulation setup comprises a 3GPP LTE BS and an oper-
4: Update the value function of states s using (16). ator-deployed IEEE 802.11g WiFi AP. All users are assumed
5: Update the LM according to Equation (20). to be stationary. Data users are distributed uniformly within
6: Update threshold vector according to Equation (18). 30 m radius of the WiFi AP which is approximately 50 m
7: Update s s0 and k k þ 1. away from the LTE BS. LTE and WiFi network parameters
8: end while used in simulations are chosen based on 3GPP [36], [37]
models and saturation throughput [20] IEEE 802.11g WiFi [6]
model and described in Tables 4 and 5. We consider the gen-
7 COMPARISON OF COMPLEXITIES OF LEARNING eration of CBR uplink traffic for voice and data users in LTE.
ALGORITHMS This is implemented in ns-3 using an application (similar to
In this section, we provide a comparison of storage and ON/OFF application) developed by us.
computational complexities of traditional Q-learning [1], pro- For the update of the PDS value functions, threshold
posed PDS learning and structure-aware learning algorithms. vector and LM, we consider gðnÞ ¼ ðb n cþ2Þ
1
0:6 and hðnÞ ¼ n .
10
1000
We summarize the storage and computational complexities
of these schemes in Table 3. 8.2 Convergence Analysis
Q-learning algorithm [1] stores value functions for every Figs. 3a and 3b illustrate how the Q-learning, PDS learning
state-action pair, i.e., jSj jAj values and updates the value and structure-aware learning algorithms converge with
function of one state-action pair at a time. While updating increasing number of iterations (n). We keep v ¼ d ¼ 1. It
the value function, Q-learning evaluates jAj functions. The is evident that both the proposed algorithms outperform
PDS learning algorithm (see Equation (11)) requires storing Q-learning in terms of convergence speed. Contrary to PDS
jSj PDS value functions and feasible actions in every state, learning, even after a considerable amount of iterations,

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
ROY ET AL.: LOW COMPLEXITY ONLINE RADIO ACCESS TECHNOLOGY SELECTION ALGORITHM IN LTE-WIFI HETNET 385

TABLE 4 TABLE 5
LTE Network Model WiFi Network Model

Parameter Value Parameter Value


Maximum capacity 10 users Channel bit rate 54 Mbps
Voice bit rate of a single user 20 kbps UDP header 224 bits
Data bit rate of a single user 5 Mbps Packet payload 1500 bytes
Voice packet payload 50 bits Slot duration 20 ms
Data packet payload 600 bits Short inter-frame space (SIFS) 10 ms
Tx power for BS and MS 46 dBm and 23 Distributed Coordination Function IFS 50 ms
dBm (DIFS)
Noise figure for BS and MS 5 dB and 9 dB Minimum acceptable per-user 3.5 Mbps
Antenna height for BS and MS 32 m and 1.5 m throughput
Antenna parameter for BS and MS Isotropic Antenna Tx power for AP 23 dBm
Path loss (R in kms) 128:1 þ 37:6 log ðRÞ Noise figure for AP 4 dB
Antenna height for AP 2.5 m
Antenna parameter Isotropic antenna
Q-learning explores different actions with a finite probabil- Path loss (R in kms) 140:3 þ 36:7 log ðRÞ
ity, thereby reducing the convergence speed.
The knowledge of structure reduces the feasible policy
space. Therefore, the structure-aware learning algorithm and minimum values of total system throughput over this
offers faster convergence to the optimal policy. We no lon- window. We set the window size to be 100. If the ratio is
ger need to learn the optimal actions in a subset of states, more than 0.99, then we conclude that stopping criteria is
where the optimal policy is determined using structural reached, i.e., the obtained policy is in a close neighborhood
properties. As observed in Figs. 3a and 3b, the number of of the optimal policy with high probability. In Figs. 5a and
iterations before convergence reduces from 2000 to 300 and Pn
5b, total system throughput is plotted against k¼1 gðkÞ.
1000 to 300, respectively. Note that smaller number of itera- Pn
tions translates into lesser amount of time for convergence. Note that, k¼1 gðkÞ is chosen instead of n, to decouple the
Figs. 4a and 4b depict the convergence of LM as n effect of diminishing step size, while analyzing the conver-
increases. It is evident that as the number of iteration gence behavior of the proposed schemes. In other words,
increases, LM converges for both the schemes. this parameter is selected to establish that the convergence
In Table 6, we illustrate the average time taken by a single of the algorithms is indeed due to the convergence to the
iteration of Q-learning, PDS learning and structure-aware optimal policy and not due to very small values of step size
learning algorithms, respectively, in simulations. Average gðnÞ as n becomes large. We observe in Figs. 5a and 5b that
per-iteration time of an algorithm reflects the per-iteration the PDSPlearning algorithm reaches the stopping criteria
n
computational complexity. As observed from Table 6, the when k¼1 gðkÞ becomes 700, which corresponds to
time taken by the structure-aware learning algorithm is the approximately 1000 iterations. On the other hand, the
least. Also, the average per-iteration time of the PDS learning structure-aware
algorithm is slightly lower than that of Q-learning. These Pn learning algorithm reaches stopping cri-
teria when k¼1 gðkÞ becomes 300 which translates into
results corroborate the analysis illustrated in Table 3.
almost 450 iterations.
8.3 Stopping Criteria
While simulating an online learning algorithm, in practical 8.4 Consideration of Realistic Scenarios
cases, we may not want to wait till the actual convergence. While the above simulation results provide significant
Rather, we can observe the total system throughput over a insight into the convergence behavior of the proposed algo-
moving window
P of the sums of step sizes till the present rithms over traditional Q-learning, in this section, we evalu-
iteration ( nk¼1 gðkÞ) and calculate the ratio of maximum ate the performances of the proposed algorithms in realistic

Fig. 3. Plot of total system throughput versus number of iterations ðnÞ for different algorithms.

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
386 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 19, NO. 2, FEBRUARY 2020

Fig. 4. Plot of LM versus number of iterations ðnÞ for different algorithms.

scenarios. We compare the total system throughput and WiFi, if any, is offloaded to LTE. While offloading, we always
voice user blocking probability performance of the pro- choose the user with the worst channel.
posed algorithms with that of the on-the-spot offloading
[38] and LTE-preferred schemes [19].
Although in the system model (See Section 2) we consider 8.4.1 Voice User Arrival Rate Variation
single resource block allocation to LTE data users, in simula- Fig. 6a depicts the blocking percentage of voice users for
tions we relax this assumption and consider proportional on-the-spot offloading, LTE-preferred and the proposed
fair scheduling for the LTE BS which dynamically assigns algorithms for varying v . Since on-the-spot offloading
resources to the users based on user bandwidth demand. blocks voice user when LTE reaches its capacity, blocking
Users randomly generate individual bandwidth demands. probability of voice users increases with v . Since PDS
However, we assume that the maximum data rate achievable learning and structure-aware learning algorithms learn in
for a single data user is 5 Mbps and the bottleneck is in the which states blocking is to be chosen as the optimal
access network. Furthermore, in the previous sections, there action, voice user blocking probability corresponding to
is no consideration of channel fading effects in LTE and these algorithms converge to the same value. Since the
WiFi. To address that, whenever we choose an action involv- proposed algorithms may block voice users even when
ing offloading of a user from one RAT to another (A4 and
the LTE system does not reach its capacity, the blocking
A5 ), the user with the worst channel is selected for offload-
probability values are marginally higher than that of
ing. For example, whenever A4 is chosen and we offload a
data user from LTE to WiFi, we always choose the data user on-the-spot offloading. Voice users may be blocked to
with the lowest Signal-to-noise Ratio (SNR). Since, in general, save LTE resources for future data user arrivals which
a user with bad channel provides bad throughput to the sys- have a higher throughput contribution to the system.
tem, the user with the worst channel is chosen for offloading. LTE-preferred scheme blocks a voice user when LTE sys-
We consider Extended Pedestrian A model [39] for fading in tem is full and there is no data user in LTE. Therefore,
LTE and Rayleigh fading for WiFi. on-the-spot offloading and LTE-preferred schemes pro-
In on-the-spot offloading [38], data users always choose vide similar blocking probability performance.
WiFi unless WiFi coverage is not present. Therefore, in our Fig. 6b illustrates the total system throughput perfor-
system model, on-the-spot offloading always associates data mance of different algorithms under varying v . With
users with WiFi until capacity is reached in WiFi. Voice users increase in v , the average number of voice users in the sys-
are associated with LTE, and when LTE reaches its capacity, tem increases while the number of WiFi data users remains
voice users are blocked. In LTE-preferred scheme [19], voice the same. Therefore, in the case of on-the-spot offloading,
and data users are associated with LTE until LTE reaches its the total system throughput increases with v . However,
capacity. When LTE reaches its capacity and a voice user since the throughput of voice users is small compared to
arrives, the voice user is blocked if there is no data user in data users, the rate of increment is very small. PDS learning
LTE. Otherwise, one existing data user is offloaded to WiFi if and structure-aware learning algorithms learn the optimal
capacity is available in WiFi. Upon the departure of an exist- policy which does significant load balancing via A4 and A5 .
ing voice or data user from LTE, an existing data user in Also, while offloading, the proposed algorithms take chan-
nel state of users into account. Thus, these algorithms out-
perform on-the-spot offloading in terms of total system
TABLE 6 throughput with performance improvement varying from
Average Per-Iteration Time of Different Algorithms 10.72 percent (for v ¼ 0:13) to 28.72 percent (for v ¼ 0:07).
With increase in v , LTE-preferred scheme starts offloading
Algorithm Average per-iteration time (ms)
data users to WiFi to accommodate incoming voice users.
Q-learning [1] 49.66 Under low WiFi load, total throughput of the system
PDS learning 40.07 increases. As WiFi load increases, the rate of increment
Structure-aware learning 32.119 decreases. However, both the proposed algorithms perform
better than LTE-preferred scheme.

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
ROY ET AL.: LOW COMPLEXITY ONLINE RADIO ACCESS TECHNOLOGY SELECTION ALGORITHM IN LTE-WIFI HETNET 387

P
Fig. 5. Plot of total system throughput versus sum of step sizes till nth iteration ( nk¼1 gðkÞ) for different algorithms.

Fig. 6. Plot of voice user blocking fraction and total system throughput for different algorithms. (a) Voice user blocking percentage versus v . (b) Total
system throughput versus v (d ¼ 1=20; mv ¼ 1=60, and md ¼ 1=10). (c) Voice user blocking percentage versus d . (d) Total system throughput
versus d (v ¼ 1=6; mv ¼ 1=60, and md ¼ 1=10).

8.4.2 Data User Arrival Rate Variation when LTE does not have available capacity and there is no
As observed in Fig. 6c, since in on-the-spot offloading, data user in LTE, the blocking probabilities of LTE-preferred
data and voice users are served using WiFi and LTE, respec- scheme and on-the-spot offloading are same.
tively, changes in d do not impact the blocking probability Since on-the-spot offloading associates data user with
of voice users. Performances of both PDS learning and struc- WiFi, with increase in d , the load in WiFi increases. As a
ture-aware learning algorithms are similar to that of on-the- result, as d (See Fig. 6d) increases, the effect of contention
spot offloading. Due to the presence of a constraint on the and channel fading reduces the rate of increment of
voice user blocking probability, most of the voice users are throughput. Both the proposed algorithms perform better
blocked when the LTE system reaches capacity. Therefore, than on-the-spot offloading by virtue of optimal RAT selec-
the proposed algorithms do most of the blocking of voice tion and load balancing actions which reduces the effect of
users in the same decision epochs as that of on-the-spot off- contention in WiFi. Also, while offloading, the proposed
loading. Since LTE-preferred scheme blocks voice users only algorithms take channel state of users into account.

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
388 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 19, NO. 2, FEBRUARY 2020

Fig. 7. (a) Total system throughput versus v (d ¼ 1=20; mv ¼ 1=60, and md ¼ 1=10). (b) Total system throughput versus d (v ¼ 1=6; mv ¼ 1=60, and
md ¼ 1=10).

Therefore, the proposed algorithms outperform on-the-spot establish that the proposed schemes outperform on-the-spot
offloading with performance improvement varying from 20 offloading and LTE-preferred schemes under realistic net-
percent (for d ¼ 0:1) to 54.6 percent (for d ¼ 0:5). As d work scenarios.
increases, LTE-preferred scheme starts offloading more data
users to WiFi. Therefore, the system throughput increases. ACKNOWLEDGMENTS
Under high d , the effect of contention is lesser than that of
on-the-spot offloading, resulting in a better performance This work has been funded by the Ministry of Electronics
than on-the-spot offloading. However, the proposed algo- and Information Technology (MeitY), Government of India,
rithms perform better than LTE-preferred scheme. as part of the "5G Research and Building Next Gen Solutions
for Indian Market" project.
8.5 Consideration of User Mobility
In this section, we evaluate how the proposed algorithms REFERENCES
perform in comparison to on-the-spot offloading and LTE-
[1] A. Roy, P. Chaporkar, and A. Karandikar, “An on-line radio
preferred scheme in the face of user mobility. In addition to access technology selection algorithm in an LTE-WiFi network,”
ns-3 simulation settings described in the last section, we in Proc. IEEE Wireless Commun. Netw. Conf., 2017, pp. 1–6.
also consider random waypoint model [40] for mobility of [2] Y. He, M. Chen, B. Ge, and M. Guizani, “On WiFi offloading in hetero-
voice and data users. As evident from Figs. 7a and 7b, geneous networks: Various incentives and trade-off strategies,” IEEE
Commun. Surveys Tuts., vol. 18, no. 4, pp. 2345–2385, Oct.–Dec. 2016.
although total system throughputs provided by different [3] Cisco, “Cisco visual networking index: Global mobile data traffic
algorithms change due to mobility, comparative perfor- forecast update, 2013–2018,” White Paper, 2014.
mance of the proposed algorithms with respect to on-the- [4] V. G. Nguyen, T. X. Do, and Y. Kim, “SDN and virtualization-
spot offloading and LTE-preferred scheme remains the based LTE mobile network architectures: A comprehensive
same. Since mobility does not have any impact on the block- survey,” Wireless Pers. Commun., vol. 86, no. 3, pp. 1401–1438, 2016.
[5] N. M. Akshatha, P. Jha, and A. Karandikar, “A centralized SDN
ing probability of the voice users, the blocking probability architecture for the 5G cellular network,” in Proc. IEEE 5G World
performances of the considered algorithms are exactly the Forum, 2018, pp. 147–152.
same as that described in Figs. 6a and 6c. [6] Wireless LAN Medium Access Control (MAC) and Physical Layer
(PHY) Specifications, IEEE Standard 802.11–2012, Part 11, 2012.
[7] 3GPP TR 37.834 v0.3.0, “Study on WLAN/3GPP Radio Inter-
9 CONCLUSIONS working,” (2013, Jun.). [Online]. Available: http://www.3gpp.
org/DynaReport/37834.htm
In this paper, we have proposed a PDS learning algorithm [8] A. Whittier, P. Kulkarni, F. Cao, and S. Armour, “Mobile data off-
which can be implemented online without the knowledge loading addressing the service quality versus resource utilisation
of statistics of arrival processes. It has been proved that the dilemma,” in Proc. IEEE 27th Annu. Int. Symp. Pers. Indoor Mobile
Radio Commun., 2016, pp. 1–6.
algorithm converges to the optimal policy. Furthermore,
[9] F. Moety, M. Ibrahim, S. Lahoud, and K. Khawam, “Distributed heuris-
another online algorithm, which exploits the threshold tic algorithms for RAT selection in wireless heterogeneous networks,”
structure of optimal policy, has been proposed. The knowl- in Proc. IEEE Wireless Commun. Netw. Conf., 2012, pp. 2220–2224.
edge of threshold structure provides improvements in [10] E. Aryafar, A. Keshavarz-Haddad, M. Wang, and M. Chiang,
computational and storage complexity and convergence “RAT selection games in HetNets,” in Proc. IEEE INFOCOM,
2013, pp. 998–1006.
time. The proposed algorithm provides a novel framework
[11] K. Lee, J. Lee, Y. Yi, I. Rhee, and S. Chong, “Mobile data offload-
that can be applied for designing online learning algorithms ing: How much can WiFi deliver?” IEEE/ACM Trans. Netw.,
for any general problem and is of independent interest. We vol. 21, no. 2, pp. 536–550, Apr. 2013.
have proved that the structure-aware learning algorithm [12] D. Suh, H. Ko, and S. Pack, “Efficiency analysis of WiFi offloading
convergences to globally optimal threshold vector. Simula- techniques,” IEEE Trans. Veh. Technol., vol. 65, no. 5, pp. 3813–
3817, May 2016.
tion results have been presented to exhibit how the PDS
[13] N. Cheng, N. Lu, N. Zhang, X. Zhang, X. S. Shen, and J. W. Mark,
paradigm and the knowledge of structural properties pro- “Opportunistic WiFi offloading in vehicular environment: A
vide improvement in convergence time over traditional game-theory approach,” IEEE Trans. Intell. Transp. Syst., vol. 17,
online association algorithms. Moreover, simulation results no. 7, pp. 1944–1955, Jul. 2016.

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.
ROY ET AL.: LOW COMPLEXITY ONLINE RADIO ACCESS TECHNOLOGY SELECTION ALGORITHM IN LTE-WIFI HETNET 389

[14] A. Roy and A. Karandikar, “Optimal radio access technology [41] N. Salodkar, A. Bhorkar, A. Karandikar, and V. S. Borkar, “An on-
selection policy for LTE-WiFi network,” in Proc. IEEE Int. Symp. line learning algorithm for energy efficient delay constrained
Model. Optimization Mobile Ad Hoc Wireless Netw., 2015, pp. 291–298. scheduling over a fading channel,” IEEE J. Sel. Areas Commun.,
[15] G. S. Kasbekar, P. Nuggehalli, and J. Kuri, “Online client-AP asso- vol. 26, no. 4, pp. 732–742, May 2008.
ciation in WLANs,” in Proc. IEEE Int. Symp. Model. Optimization [42] V. R. Konda and V. S. Borkar, “Actor-critic-type learning algo-
Mobile Ad Hoc Wireless Netw., 2006, pp. 1–8. rithms for Markov decision processes,” SIAM J. Control Optimiza-
[16] K. Khawam, S. Lahoud, M. Ibrahim, M. Yassin, S. Martin, M. El tion, vol. 38, no. 1, pp. 94–123, 1999.
Helou, and F. Moety, “Radio access technology selection in het- [43] J. Abounadi, D. Bertsekas, and V. S. Borkar, “Learning algorithms
erogeneous networks,” Phys. Commun., vol. 18, pp. 125–139, 2016. for Markov decision processes with average cost,” SIAM J. Control
[17] S. Barmpounakis, A. Kaloxylos, P. Spapis, and N. Alonistioti, Optimization, vol. 40, no. 3, pp. 681–698, 2001.
“Context-aware, user-driven, network-controlled RAT selection [44] V. S. Borkar and S. P. Meyn, “The ODE method for convergence of
for 5G networks,” Comput. Netw., vol. 113, pp. 124–147, 2017. stochastic approximation and reinforcement learning,” SIAM J.
[18] B. H. Jung, N. O. Song, and D. K. Sung, “A network-assisted user- Control Optimization, vol. 38, no. 2, pp. 447–469, 2000.
centric WiFi-offloading model for maximizing per-user through- [45] V. S. Borkar, “An actor-critic algorithm for constrained Markov deci-
put in a heterogeneous network,” IEEE Trans. Veh. Technol., sion processes,” Syst. Control Lett., vol. 54, no. 3, pp. 207–213, 2005.
vol. 63, no. 4, pp. 1940–1945, May 2014.
[19] A. Roy, P. Chaporkar, and A. Karandikar, “Optimal radio access Arghyadip Roy received the BE degree from
technology selection algorithm for LTE-WiFi network,” IEEE Jadavpur University, Kolkata, India, in 2010, and
Trans. Veh. Technol., vol. 67, no. 7, pp. 6446–6460, Jul. 2018. the MTech degree from IIT Kharagpur, India, in
[20] G. Bianchi, “Performance analysis of the IEEE 802.11 distributed 2012. He is currently a research scholar with
coordination function,” IEEE J. Sel. Areas Commun., vol. 18, no. 3, the Department of Electrical Engineering, IIT
pp. 535–547, Mar. 2000. Bombay, India. He previously worked with the
[21] M. El Helou, M. Ibrahim, S. Lahoud, K. Khawam, D. Mezher, and Samsung R&D Institute-Bangalore, India. His
B. Cousin, “A network-assisted approach for RAT selection in het- research interests include resource allocation,
erogeneous cellular networks,” IEEE J. Sel. Areas Commun., vol. 33, optimization, and control of stochastic systems.
no. 6, pp. 1055–1067, Jun. 2015. He is a student member of the IEEE.
[22] E. Khloussy, X. Gelabert, and Y. Jiang, “Investigation on MDP-
based radio access technology selection in heterogeneous wireless Vivek Borkar received the BTech degree in elec-
networks,” Comput. Netw., vol. 91, pp. 57–67, 2015. trical engineering from IIT Bombay, in 1976, the
[23] R. Li, Z. Zhao, J. Zheng, C. Mei, Y. Cai, and H. Zhang, “The learning MS degree in systems and control from Case
and prediction of application-level traffic data in cellular networks,” Western Reserve University, in 1977, and the PhD
IEEE Trans. Wireless Commun., vol. 16, no. 6, pp. 3899–3912, Jun. 2017. degree in electrical engineering and computer sci-
[24] K. Kumar, A. Gupta, R. Shah, A. Karandikar, and P. Chaporkar, “On ence from the University of California at Berkeley,
analyzing Indian cellular traffic characteristics for energy efficient net- in 1980. He has held positions at the TIFR Center
work operation,” in Proc. IEEE 21st Nat. Conf. Commun., 2015, pp. 1–6. for Applicable Mathematics and the Indian Institute
[25] U. Paul, A. P. Subramanian, M. M. Buddhikot, and S. R. Das, of Science in Bengaluru, and Tata Institute of Fun-
“Understanding traffic dynamics in cellular data networks,” in damental Research, Mumbai, before joining IIT
Proc. IEEE INFOCOM, 2011, pp. 882–890. Bombay, Mumbai, as institute chair professor of
[26] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc- electrical engineering in Aug. 2011. He has held visiting positions with the
tion. Cambridge, MA, USA: MIT Press, 1998. University of Twente, MIT, the University of Maryland at College Park, and
[27] K. Adachi, M. Li, P. H. Tan, Y. Zhou, and S. Sun, “Q-Learning the University of California at Berkeley. He is a fellow of the IEEE, American
based intelligent traffic steering in heterogeneous network,” in Math. Society, TWAS, and the science and engineering academies in
Proc. IEEE Veh. Technol. Conf. Spring, 2016, pp. 1–5. India. His research interests include stochastic optimization and control,
[28] S. Anbalagan, D. Kumar, D. Ghosal, G. Raja, and V. Muthuval- covering theory, algorithms, and applications.
liammai, “SDN-assisted learning approach for data offloading in
5G HetNets,” Mobile Netw. Appl., vol. 22, no. 4, pp. 771–782, 2017. Prasanna Chaporkar received the MS degree
[29] ns-3 simulator. (2018, Dec.). [Online]. Available: http://code. from the Faculty of Engineering, Indian Institute of
nsnam.org/ns-3-dev/ Science, Bangalore, India, in 2000, and the PhD
[30] T. Bonald and J. W. Roberts, “Internet and the Erlang formula,” ACM degree from the University of Pennsylvania,
SIGCOMM Comput. Commun. Rev., vol. 42, no. 1, pp. 23–30, 2012. Philadelphia, Pennsylvania, in 2006. He was a
[31] E. Altman, Constrained Markov Decision Processes. Boca Raton, FL, ERCIM post-doctoral fellow with ENS, Paris,
USA: CRC Press, 1999. France, and NTNU, Trondheim, Norway. Currently,
[32] M. L. Puterman, Markov Decision Processes: Discrete Stochastic he is an associate professor with the Indian Insti-
Dynamic Programming. Hoboken, NJ, USA: Wiley, 2014. tute of Technology, Mumbai. His research interests
[33] F. J. Beutler and K. W. Ross, “Optimal policies for controlled Mar- include resource allocation, stochastic control,
kov chains with a constraint,” J. Math. Anal. Appl., vol. 112, no. 1, queueing theory, and distributed systems and
pp. 236–252, 1985. algorithms. He is a member of the IEEE.
[34] V. S. Borkar, Stochastic Approximation: A Dynamical Systems View-
point. Cambridge, U.K.: Cambridge Univ. Press, 2008.
[35] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization of Abhay Karandikar is currently director of IIT
Markov reward processes,” IEEE Trans. Autom. Control, vol. 46, Kanpur (on leave from IIT Bombay). He is also a
no. 2, pp. 191–209, Feb. 2001. member (part-time) of the Telecom Regulatory
[36] 3GPP TR 36.814 v9.0.0, “Further advancements for E-UTRA physi- Authority of India (TRAI). In IIT Bombay, he
cal layer aspects,” (2010, Mar.). [Online]. Available: http:// served as institute chair professor with the Elec-
www.3gpp.org/dynareport/36814.htm trical Engineering Department, dean (Faculty
[37] 3GPP TR 36.839 v11.1.0, “Mobility enhancements in heteroge- Affairs) from 2017 to 2018, and head of the Elec-
neous networks,” (2012, Dec.). [Online]. Available: http:// trical Engineering Department from 2012 to 2015.
www.3gpp.org/dynareport/36839.htm He is the founding member of the Telecom
[38] F. Mehmeti and T. Spyropoulos, “Performance analysis of “on- Standards Development Society, India (TSDSI),
the-spot” mobile data offloading,” in Proc. IEEE Global Commun. India’s standards body for telecom. He was the
Conf., 2013, pp. 1577–1583. chairman of TSDSI from 2016 to 2018. His research interests include
[39] 3GPP TS 36.104 V10.2.0, “Base Station (BS) Radio Transmission resource allocation in wireless networks, software defined networking,
and Reception,” (2011, Apr.). [Online]. Available: http:// frugal 5G, and rural broadband. He is a member of the IEEE.
www.3gpp.org/dynareport/36104.htm
[40] D. B. Johnson and D. A. Maltz, “Dynamic source routing in ad hoc " For more information on this or any other computing topic,
wireless networks,” in Mobile Computing. Berlin, Germany: please visit our Digital Library at www.computer.org/csdl.
Springer, 1996, pp. 153–181.

Authorized licensed use limited to: UNIVERSITY PUTRA MALAYSIA. Downloaded on March 10,2020 at 05:36:03 UTC from IEEE Xplore. Restrictions apply.

You might also like