Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services

624 IEEE INTERNET OF THINGS JOURNAL, VOL. 5, NO.
2, APRIL 2018
Semisupervised Deep Reinforcement Learning in

Support of IoT and Smart City Services
Mehdi Mohammadi, Graduate Student Member, IEEE, Ala Al-Fuqaha, Senior Member, IEEE,
Mohsen Guizani, Fellow, IEEE, and Jun-Seok Oh
Abstract—Smart services are an important element of the where a large number of sensors participate in generating data
smart cities and the Internet of Things (IoT) ecosystems where the without being able to obtain class labels corresponding to the
intelligence behind the services is obtained and improved through collected data. Smart cities as a prominent application area of
the sensory data. Providing a large amount of training data is not
always feasible; therefore, we need to consider alternative ways the IoT should provide a range of high-quality smart services
that incorporate unlabeled data as well. In recent years, deep to meet the citizen’s needs [1]. Smart buildings are one of the
reinforcement learning (DRL) has gained great success in sev- main building blocks of smart cities as citizens spend a sig-
eral application domains. It is an applicable method for IoT and nificant part of their time indoors. People nowadays spend
smart city scenarios where auto-generated data can be partially over 87% of their daily lives indoors [2] for work, shop-
labeled by users’ feedback for training purposes. In this paper,
we propose a semisupervised DRL model that fits smart city ping, education, etc. Therefore, having a smart environment
applications as it consumes both labeled and unlabeled data to that provides services to meet the needs to its inhabitants is
improve the performance and accuracy of the learning agent. The a valuable asset for organizations. Such services facilitate the
model utilizes variational autoencoders as the inference engine development of smart cities. Location-aware services in indoor
for generalizing optimal policies. To the best of our knowledge, environments play a significant role in this era. Examples of
the proposed model is the first investigation that extends DRL
to the semisupervised paradigm. As a case study of smart city applications of such services are smart home management [3],
applications, we focus on smart buildings and apply the proposed delivering cultural contents in museums [4], location-based
model to the problem of indoor localization based on Bluetooth authentication and access control [5], location-aware market-
low energy signal strength. Indoor localization is the main com- ing and advertisement [6], [7], and wayfinding and navigation
ponent of smart city services since people spend significant time in smart campuses [8]. Moreover, locating users in indoor
in indoor environments. Our model learns the best action poli-
cies that lead to a close estimation of the target locations with environments is very important for smart buildings because
an improvement of 23% in terms of distance to the target and it serves as the link that enables the users to interact with
at least 67% more received rewards compared to the supervised other IoT services [9].
DRL model. Deep learning is a powerful machine learning approach that
Index Terms—Bluetooth low energy indoor localization, deep provides function approximation, classification, and prediction
learning, deep reinforcement learning (DRL), indoor position- capabilities. Reinforcement learning is another class of
ing, Internet of Things (IoT), IoT smart services, reinforcement machine learning approaches for optimal control and decision
learning, semisupervised deep reinforcement learning, smart city. making processes where a software agent learns an optimal
policy of actions over the set of states in an environment. In the
I. I NTRODUCTION applications where the number of states is very large, a deep
learning model can be used to approximate the action values
HE RAPID development of Internet of Things (IoT) tech-
T nologies motivated researchers and developers to think
about new kinds of smart services that extract knowledge from
(i.e., how good an action is in a given state). Systems that com-
bine deep and reinforcement learning are in their initial phases
but already produced competitive results in some application
IoT generated data. The scarcity of labeled data is a main issue areas (e.g., video games). Moreover, learning approaches with
for developing such solutions especially for IoT applications no or little supervision are expected to get more momentum
in the future [10] mimicking the natural learning processes of
Manuscript received January 30, 2017; revised May 4, 2017; accepted
May 25, 2017. Date of publication June 9, 2017; date of current ver- humans and animals.
sion April 10, 2018. This publication was made possible by NPRP Grant IoT applications can benefit from the decision process for
# [7-1113-1-199] from the Qatar National Research Fund (a member of Qatar learning purposes. For example, in the case of location-aware
Foundation). The statements made herein are solely the responsibility of the
authors. (Corresponding author: Mohsen Guizani.) services, location estimation can be seen as a decision process
M. Mohammadi and A. Al-Fuqaha are with the Department of Computer in which a software agent determines the exact or closest point
Science, Western Michigan University, Kalamazoo, MI 49008 USA (e-mail: to a specific target. In this regard, reinforcement learning [11]
mehdi.mohammadi@wmich.edu; ala.al-fuqaha@wmich.edu).
M. Guizani is with the Department of Electrical and Computer Engineering, can be exploited to formulate and solve the problem. In a rein-
University of Idaho, Moscow, ID 83844 USA (e-mail: mguizani@ieee.org). forcement learning solution, a software agent interacts with
J.-S. Oh is with the Department of Civil and Construction Engineering, the environment and changes the state of the environment by
Western Michigan University, Kalamazoo, MI 49008 USA (e-mail:
jun.oh@wmich.edu). performing some actions. Depending on the performed action,
Digital Object Identifier 10.1109/JIOT.2017.2712560 the environment sends a reward to the agent. The agent tries
2327-4662 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: IFSTTAR. Downloaded on December 28,2023 at 20:18:33 UTC from IEEE Xplore. Restrictions apply.
MOHAMMADI et al.: SEMISUPERVISED DRL IN SUPPORT OF IoT AND SMART CITY SERVICES 625
to maximize its rewards over time by choosing those actions 2) We leverage both labeled and unlabeled data in our
that result in higher rewards. A new variation of reinforce- model. Since unlabeled data are more prevalent, this is a
ment learning, deep reinforcement learning (DRL) recently key feature for IoT applications where IoT sensors gen-
was demonstrated by Google to achieve high accuracy in the erate large volumes of data while they cannot be labeled
Atari games [12] and is a suitable candidate for the learning easily. Therefore, our approach helps to alleviate having
process in the IoT applications. a lot of labeled data. In addition, the performance of
In this paper, we propose a semisupervised DRL model DRL is enhanced by using the proposed semisupervised
to benefit from the large number of unlabeled data that are approach.
generated in IoT applications. The idea of extending reinforcement learning algorithms
The points that motivate us for this paper are as follows. to semisupervised reinforcement learning has not been stud-
1) In the world of IoT where sensors generate a lot ied well so far. There are some suggestions that explain the
of data that cannot be labeled manually for train- possibility of semisupervised reinforcement learning by hav-
ing purposes, semisupervised approaches are valuable ing unlabeled episodes in which the agent does not receive
approaches. Moreover, building a DRL framework that its rewards from the environment [14], [15]. But there is
works in a semisupervised manner can serve many IoT no implementation of such extensions so far. Our proposed
applications. semisupervised DRL, however, follows a different approach
2) Games demonstrated significant improvements using where we incorporate a variational autoencoder [16] in our
DRL [12]. IoT applications also can be seen as a game framework as the semisupervised module to infer the classifi-
where the goal is to estimate the correct classification cation of unlabeled data and incorporate this information along
of a given input and hence can benefit from a DRL with the labeled data to optimize its discriminating boundaries.
approach. To apply the proposed model on a smart city scenario, we
3) The learning process for the scale of smart cities requires chose to perform the experiments on smart buildings which
many efforts including data gathering, analysis, and clas- play a significant role in smart cities. Our experimental results
sification. The strength of deep learning models stems assert the efficiency of the proposed semisupervised DRL
from the latest advancements in computational and data model compared to the supervised DRL model. Specifically,
storage capabilities. Such models can be utilized to the results have been improved by 23% for a small number of
develop scalable and efficient learning solutions for training epochs. Also, considering the average performance of
smart cities from crowd-sensed big data. both models in terms of received rewards, the semisupervised
4) Smart city applications can be trained in a lab model outperforms the supervised model by obtaining twice
and deployed in a real environment without losing as many rewards.
performance. For example, a self-driving car needs The rest of this paper is organized as follows. Section II
to learn how to perform in a variety of conditions organizes the recent related works into two parts: one that
(e.g., approaching pedestrians, handling traffic signs, reviews attempts that utilize DRL models and the other that
etc.) which can be learned in a few test drives. But it deals with the indoor localization as a case study. Section III
is impossible to account for all the scenarios that might presents related background and then introduces the details of
happen in a given city. the proposed approach. Section IV presents a use-case study
Also, this paper is motivated by specific observations in which the proposed model is used for indoor localiza-
regarding the localization problem. tion systems using iBeacon signals. Experimental results are
1) While WiFi fingerprinting has been studied widely in presented in Section V followed by concluding remarks in
the past decade for indoor positioning and the accuracy Section VI.
is in the range of 10 m, Bluetooth low energy (BLE)
is in its infancy for indoor localization and has yielded II. R ELATED W ORK
more fine-grained results [13]. In the following sections, we first review some recent
2) There are many practical applications that need an research efforts that utilize DRL. Then, we address the latest
efficient mechanism for positioning in small scale envi- research efforts that address the indoor localization problem
ronments such as robotic soccer games to locate the from machine learning perspective.
position of the ball, or navigating robots in a build-
ing. Our proposed approach can be extended and used A. Deep Reinforcement Learning
in such scenarios aiming to enhance micro-localization DRL has been proposed in recent years [12] and is gaining
accuracy in support of smart environments [9]. attention to be applied in various application domains. In the
The contributions of this paper are as follows. following paragraphs, we review some of the latest research
1) We propose a semisupervised DRL framework based efforts that utilize DRL in different application areas.
on deep generative models and reinforcement learning Nemati et al. [17] utilized a DRL algorithm to learn action-
that combines the strengths of deep neural networks able policies for administering an optimal dose of medicines
and statistical modeling of data density in a reinforce- like heparin for individuals. They used a sample dataset of
ment learning paradigm. To the best of our knowledge, dosage trials and their outcomes from a bunch of electronic
this paper is the first attempt to address semisupervised medical records. In their model, they used a discriminative
learning through DRL. hidden Markov model (HMM) for state estimation and a
626 IEEE INTERNET OF THINGS JOURNAL, VOL. 5, NO. 2, APRIL 2018
Q-network with two layers of neurons. The medication dosage approaches in this context including: SVM, KNN, Bayesian-
agent tries to learn the optimal policy by maximizing its total based filtering, transfer learning, and neural networks [27].
reward which is the overall fraction of time when patients are It has been shown that for coarse-grained positioning appli-
in their therapeutic activated partial thromboplastin time range. cations based on BLE RSS fingerprinting, the estimation to
DRL has been also applied for vehicle image classifica- decide if a device is inside a room or not yields pretty reliable
tion as reported in [18]. In that work, the authors propose a results [28].
convolutional neural network (CNN) model combined with a Wang et al. [29] reported their experimental indoor posi-
reinforcement learning module to guide where to look in the tioning results based on BLE RSS readings. They study the
image for the key parts of a car. The information entropy of the accuracy of three methods including: 1) least square estimation
classification probability of a focused image that is produced (LSE); 2) three-border positioning; and 3) centroid positioning.
by their CNN model is considered as the reward for the rein- In their testing area, which is a 6 × 8 square meters classroom
forcement learning agent to learn to identify the next visual with four BLE stations at the corners, the LSE algorithm shows
attention area in the image. The work in [19] also reports on more accurate positions compared to the two other methods.
object localization in images by focusing attention on candi- However, the overall accuracy of these three algorithms is
date regions using a DRL approach. In [20], a visual navigation satisfactory.
application is presented that uses a variation of DRL by which Museums are good environments for using BLE to pro-
robots can navigate in a space toward a visual target. vide location-awareness since usually the building and its
Li et al. [21] also developed a DRL approach for traffic contents do not allow changes due to preservation policies.
signal timing aiming to have a better signal timing plan. In Alletto et al. [4] developed a system to make interactive cul-
their model which consists of a four-layer stacked autoencoder tural displays in a museum with BLE beacons combined with
neural network to estimate the Q-function, they use the queu- an image recognition wearable device. The wearable device
ing lengths of eight incoming lanes to an intersection as the performs localization by receiving BLE signals from the bea-
state of the system at each time. They also define two actions: cons to identify the room in which it is located. It also
1) stay in the current traffic lane or 2) change the lane to allow identifies artworks by an image processing service. The combi-
other traffic to go through the intersection. The absolute value nation of the closest beacon identifier and the artwork identifier
of the difference between the length of opposite lanes serves are fed to a processing center to retrieve the appropriate
as the reward function. Their results show that their proposed cultural content.
model outperformed other conventional reinforcement learning Wang et al. [30] presented a system called DeepFi that uti-
approaches. lizes a deep learning method over fingerprinting data to locate
Resource management is another task that can use DRL as indoor positions based on channel state information (CSI).
its underlying mechanism. In their report, Mao et al. [22] for- As many other fingerprinting approaches, their system con-
mulated the problem of job scheduling with multiple resource sists of offline training and online localization phases. In the
demands as a DRL process. In their approach, the objective offline training phase, they exploit deep learning to train all
is to minimize the average job slowdown. The reward func- the weights as fingerprints based on the previously stored CSI.
tion was defined based on the reciprocal duration of the job Their evaluations in a living room and laboratory settings show
in order to guide the agent toward the objective. that the use of deep learning result in improved localization
Another application in which DRL played a key role is nat- accuracy of 20%. While their use of CSI approach is limited
ural language understanding for text-based games [23], [24]. to WiFi networks, not all available network interface cards in
For example, Narasimhan et al. [23] used long short-term the market support obtaining measurements from the different
memory networks to train the agent with useful representa- network channels.
tions of text descriptions and a deep Q-network to approximate In [31], deep learning joined with semisupervised learning
Q-functions. Other disciplines like energy management also as well as extreme learning machine are applied to unla-
have incorporated DRL to improve energy utilization [25]. beled data to study the performance of feature extraction
and classification phases of indoor localization. In their study,
B. Review of Indoor Localization deep learning network and semisupervised learning generate
To the best of our knowledge, there are no prior research high level abstract features and more accurate classification
efforts that utilized DRL for localization. In the follow- while extreme learning machine can speed up the learning
ing paragraphs, we review the different machine learning process. Their test setting is a 10×15 square meters area.
approaches that were utilized in the recent research literature Their results show that deep learning can improve the accu-
to provide indoor localization services. racy of fingerprinting by at least 1.3% for the same training
Among the approaches for deploying location-based ser- dataset compared to a shallow learning method. Also increas-
vices, relative signal strength (RSS) fingerprinting is one of ing unlabeled data has a positive effect on the accuracy
the most promising approaches. However, there are some compared to shallow feature methods. Compared to other
challenges that need to be considered in the deployment of deep learning methods including stacked autoencoder, deep
such approach including fingerprint annotation and device belief network, and multilayer extreme learning machine, their
diversity [26]. The use of fingerprint-based approaches to iden- approach improves the accuracy at least 10%.
tify an indoor position has been studied well in the past In another study [27], Zhang et al. proposed a WiFi local-
decade. Researchers have studied different machine learning ization approach using deep neural networks (DNNs). In their
system, a four-layer deep learning model is used to extract

features from WiFi RSS data. In their approach, the authors
use stacked denoising autoencoder and backpropagation for the
training steps. In the online positioning phase, the estimated
position based on DNN is refined by an HMM component.
Their experiments assert that the number of hidden layers
and neurons have a direct effect on the localization accuracy.
Increasing the layers leads to better results, but at some point
when the network is made deeper, the results start degrad-
ing. Their result shows that when using three hidden layers
with 200 neurons for each layer, the model achieves the best
accuracy. Fig. 1. High-level concept of a variational autoencoder adopted for DRL.
Ding et al. [32] also used an artificial neural network (ANN)
for WiFi fingerprinting localization. They proposed a localiza-
tion approach that uses ANNs in conjunction with a clustering
proposed semisupervised DRL model by adopting a varia-
method based on affinity propagation. By affinity propagation
tional autoencoder in a DRL model. We develop the theoretical
clustering, the training of the ANN model has been faster and
foundation of our method based on [12] and [16].
the memory overhead has been lowered. They also reported
improved positioning accuracy compared to other baseline
methods. A. Semisupervised Learning Using VAE
In [33], a deep belief network (DBN) is used for a localiza-
Semisupervised learning methods aim to improve the
tion approach that is based on fingerprinting of ultrawideband
generalization of supervised learning tasks using unlabeled
signals in an indoor environment. Parameters of channel
data [36]. They usually use a small set of annotated data along
impulse response are used to get a dataset of fingerprints.
with a larger number of unlabeled data to train the model. In
Compared to other methods, the author demonstrated that
a semisupervised setting, we have two datasets; one is labeled
DBN can improve the localization accuracy.
and the other is unlabeled. The labeled dataset is denoted by
The work in [34] also reports using a deep learning model
Xl = (x1 , . . . , xl ) for which labels Yl = (y1 , . . . , yl ) are pro-
in conjunction with a regression model to automatically learn
vided. The other set is Xu = (xl+1 , . . . , xl+u ) with unknown
discriminative features from the received wireless signal.
labels. Semisupervised algorithms are built based on at least
The authors also use a softmax regression algorithm to per-
one of the following three assumptions [37]: 1) the smooth-
form device-free localization and activity recognition. They
ness assumption that states if two points x1 and x2 are close
report that their proposed method can improve the localization
to each other, then their corresponding labels are very likely
accuracy by 10% compared to other methods.
to be close to each other. 2) The cluster assumption implies
Semisupervised algorithms have also been widely applied
how to identify discrete clusters. It states that if two points
to the localization problem to utilize the unlabeled data
are in the same cluster, it is more probable that they have the
for the prediction of an unknown location. For example,
same class label. 3) The manifold assumption points out that
Pulkkinen et al. [35] presented a semisupervised algorithm
high dimensional data can be mapped to a lower dimensional
based on the manifold assumption to obtain tagged finger-
one (i.e., the principle of parsimony) such that the supervised
prints out of unlabeled data using a small amount of labeled
algorithm still approximates the true class of a data point.
data. They map the high-dimensional space of fingerprints into
For the semisupervised part of our proposed model, we
a 2-D space and achieved an average error of 2 m.
adopt the deep generative model based on VAE [16]. This
This paper presents several significant differences compared
model has been used for semisupervised tasks such as the
to the aforementioned approaches. First, related research stud-
recognition of handwritten digits, house number classification,
ies in DRL do not exploit the statistical information of unla-
and motion prediction [38] with impressive results. Fig. 1
beled data, while our proposed DRL approach is extended to
shows the structure of a typical VAE model. For each data
be semisupervised and utilizes both labeled and unlabeled data.
point xi there is a vector of corresponding latent variables
Second, these approaches provide an application-dependent
denoted by zi . The distribution of labeled data is represented
solution, while this paper is a general framework that can
by p̃l (x, y), while unlabeled data are represented by p˜u (x).
work for a variety of IoT applications. Third, for localiza-
The latent feature discriminative model (M1) is created
tion systems, all aforementioned deep learning solutions rely
based on
on WiFi fingerprinting, while the context of BLE fingerprint-
ing has not been studied in conjunction with deep learning or
reinforcement learning approaches. p(z) = N (z|0, I) pθ (x|z) = f (x; z, θ ) (1)
in which p(z) is Gaussian distributed with mean vector 0

III. BACKGROUND AND P ROPOSED A PPROACH and variances presented in an identity matrix I. The function
In the following sections, we first describe the fundamen- f (x; z, θ ) is a nonlinear likelihood function with parameter θ
tals of variational autoencoders (VAEs). Then we describe our for latent variable z based on a deep neural network.
The generative semisupervised model for generating data By adding a classification loss to the above function, the
using a latent class variable y, in addition to a latent variable optimized objective function becomes
z is (M2)
J α = J + α . Ep̃l (x,y) − log qφ (y|x) (9)
p(y) = Cat(y|π )
p(z) = N (z|0, I) where α adjusts the contributions of the generative and dis-
criminative models in the learning process. During the training
pθ (x|y, z) = f (x; y, z, θ ) (2) process for both models M1 and M2, the stochastic gradient
of J is computed at each minibatch to be used for updating
where Cat(y|π ) represents a categorical distribution or in gen-
the generative parameters θ and the variational parameters φ.
eral a multinomial distribution with a vector of probabilities π
whose elements sum up to 1. In the dataset, if no label is avail-
able, the unknown labels y are considered as latent variables B. Semisupervised Deep Reinforcement Learning
in addition to z. To adopt a DRL approach, we need to define the following
The models have two lower bound objectives. To describe elements for a Markov decision process (MDP). The goal of
the model objectives, a fixed form distribution qφ (z|x) is intro- the MDP in a reinforcement learning problem is to maximize
duced with parameter φ that helps us to estimate the posterior the earned rewards.
distribution p(z|x). For all latent variables in the models, an Environment: The environment is the territory that the
inference deep neural network is introduced to generate a dis- learning agent interacts with.
tribution of the form qφ (.). For M1, a Gaussian inference Agent: The agent observes the environment, receives sen-
network qφ (z|x) is used for latent variable z sory data and performs a valid action. It then receives a reward
for its action. Through training, the agent learns to maximize
qφ (z|x) = N z|μφ (x), diag σφ2 (x) (3) its rewards.
States: The finite set of states that the environment can
in which μφ (x) is the vector of means, σφ (x) is the vector of assume. Each action of the agent puts the environment in a
standard deviations, and diag creates a diagonal matrix. For new state.
M2, an inference network is used for latent variables z and y Actions: The finite set of available actions that the agent can
using Gaussian and multinomial distributions, respectively perform causing a transition from state st at time t to state st+1
at time t + 1.
qφ (z|y, x) = N z|μφ (y, x), diag σφ2 (x) Reward Function: This function is the immediate feed-

qφ (y|x) = Cat y|πφ (x) (4) back for performing an action. The reward function can be
defined such that it reflects the closeness of the current state to
where πφ (x) is a vector of probabilities. the true class label; i.e., r(st , at , st+1 , y) = closeness(st+1 , y).
The lower bound for M1 is Depending on the problem, different distance measurements
can be applied. The point is that we need to devise larger pos-
log pθ (x) ≥ Eqφ (z|x) log pθ (x|z) − KL qφ (z|x)||pθ (z) itive rewards for more compelling results and negative rewards
= −J (x) (5) for distracting ones.
State Transition Distribution: It is the probability that action
in which KL is the Kullback-Leibler divergence function a in state s at time t will lead to state s at time t+1 : Pa (s, s ) =
between the encoding and prior distribution and can be Pr(s |s, a).
obtained as KL(qφ pθ ) = i qφi log(qφi /pθi ). Having these components, the main problem is to find a
For the model M2, two cases should be considered. The policy π (where π = at ) that maximizes the rewards: Rt =
first one deals with labeled data
t=0 γ t r , in which γ is a discount factor 0 ≤ γ < 1.
t
In the deep Q-Network approach, we need a deep neural
log pθ (x, y) ≥ Eqφ (z|x,y) log pθ (x|y, z) + log pθ (y)
network that approximates the optimal action-value func-
+ log pθ (z) − log qφ (z|x, y) tion (Q) [39]
= −L(x, y). (6)
Q∗ (s, a) = max E rt + γ rt+1 + γ 2 rt+2
π
When dealing with unlabeled data, y is treated as a latent
variable and the resulting lower bound is + . . . |st = s, at = a, π . (10)

log pθ (x) ≥ Eqφ (y,z|x) log pθ (x|y, z) + log pθ (y) This function finds the maximum sum of rewards rt dis-

+ log pθ (z) − log qφ (y, z|x) counted by γ at each time-step t, achievable by a behavior
policy π = P(a|s), after making an observation (s) and taking
= −U (x). (7)
an action (a). We can convert this equation to a simpler approx-
Then the whole dataset has its bound of marginal likelihood as imation function using Bellman equation. For a sequence of
states s and for all possible actions a , if the optimal value
J = L(x, y) + U (x) (8) Q∗ (s , a ) is known, then we can obtain the optimal strat-
(x,y)∼p̃l x∼p˜u egy by selecting the action a that maximizes the expected
Fig. 2. Proposed model. (a) DRL agent considers x values as the next state of the environment and y values as a mechanism to compute the reward. For
unlabeled data, the x values are only incorporated into the model. (b) General deep neural network to be used for supervised DRL.
value of r + γ Q∗ (s , a ) Algorithm 1 Semisupervised DRL Algorithm

1: Input: A dataset of labeled and unlabeled data
{(Xl , Yl ), Xu }
Q∗ (s, a) = Es r + γ max Q∗ s , a |s, a . (11)
a 2: Initialize the model parameters θ , φ, environment, state
space, and replay memory D
To estimate the optimal action-value function, we use a
3: for episode ← 1 to M do
nonlinear function approximator (i.e., a neural network with
4: for each sample (x, y) or x in dataset do
weights θ ) such that Q(s, a; θ ) ≈ Q∗ (s, a). The network can be
5: s0 ← make observation of sample x
trained by minimizing the loss functions Li (θi ) that is updated
6: for t ← 0 to T do
at each time-step.
7: Take an action at using -greedy strategy
We perform experience replay, so we keep track of the
8: Perform action at to change the current state st to
agent’s experiences et = (st , at , rt , st+1 ) at each time-step t
the next state st+1
in a replay dataset Dt = {e1 , . . . , et }. This dataset of recently
9: if sample is unlabeled then
experienced transitions along with the experience replay mech-
10: Infer the label based on (4): qφ (y|x) and get
anism are critical for the integration of reinforcement learning
approximate reward rt = closeness(st+1 , y)
and deep neural networks [39].
11: else
Q-learning updates are applied on samples from the training
12: Observe reward rt that corresponds to label y
data (s, a, r, s ) that are uniformly drawn from the experience
13: Store transition (st , at , rt , st+1 ) in D
replay storage D. The Q-learning update in iteration i uses the
14: Take a random minibatch of transitions (sk , ak , rk ,
following loss function:
sk+1 ) from D; 0 < k ≤ length(minibatch)
ξi 15: if sk+1 is a terminal state then
16: ξk = rk
Li (θi ) = E(s,a,r,s )∼U(D) r + γ max Q s , a ; θi−1 17: else
a
2 18: ξk = rk + γ maxa Q(sk+1 , a ; θ )
− Q(s, a; θi ) (12) 19: Apply gradient descent on (ξk −Q(sk , a; θ ))2 based
on (13)
20: end for
in which θi represents the network parameters in iteration i,
21: end for each
and the previous network parameters θi−1 are used to compute
22: end for
the target (ξi ). The gradient of the loss function is computed
with respect to the weights of the network

Fig. 2 shows the high-level model that uses the DRL tech-
∇θi Li (θi ) = E r + γ max Q s , a ; θi−1 − Q(s, a; θi )
a nique in conjunction with a generative semisupervised model
instead of a DNN [see Fig. 2(b)] to handle unlabeled observa-
∇θi Q(s, a; θi ) . (13) tions. The VAE is extended to have an additional hidden layer
and an output to generate the actions.
The semisupervised DRL algorithm is then described in As other learning processes, the training process for this
Algorithm 1 to learn from both labeled and unlabeled data. algorithm is performed offline while policy prediction is
performed online. Hence, the algorithm can handle prob- to investigate the benefits of unlabeled data in practical
lems with high-dimensional and high-volume data using high scenarios.
performance computing facilities (e.g., cloud servers) to gen- Compared to many related works that have performed their
erate the model for online policy prediction. This ability studies in a simulated environment, a small area, or in an iso-
stems from the integration of deep neural networks with rein- lated testbed, we conducted our experiments in an academic
forcement learning to generate approximation functions for library that is a large and busy operational environment where
high-dimensional datasets. The performance of this integrated thousands of visitors commute every day. So it is a valuable
model outperforms the traditional methods of reinforcement experiment that can be beneficial for the IoT and AI commu-
learning. nities. In addition, there are no similar attempts that address
the positioning problem through the reinforcement learning
approach.
IV. U SE C ASE : I NDOOR L OCALIZATION In this case study, we utilize a grid of iBeacons to imple-
Several use cases can be envisaged of the proposed approach ment a location-aware service offering in a campus setting.
in a smart city context. For example, this approach can be used In this paper, we use the iBeacons’ received signal strength
for home energy management in conjunction with the nonin- indicator (RSSI) as the raw source of input data for a DRL
trusive load monitoring (NILM) method and smart meters. In model to identify indoor locations.
such systems, a small set of labeled data provides individual RSSI is usually represented by a negative number between
appliances’ usages and their on and off times. A semisuper- 0 and −100 and in localization systems it can be used as an
vised DRL model can be trained over this small-scale training indication of the distance separating the transmitter from the
dataset as well as the stream of unlabeled data with the objec- receiver (i.e., ranging). In addition to the separating distance,
tive of optimizing energy usage by controlling when to switch RSSI is affected by some other factors such as movement of
appliances on and off. people and objects amidst the signals, temperature, and humid-
It can also be used in the context of intelligent transportation ity of the environment. The distance estimation from a given
systems by smart vehicles for navigation in a city context. In point to an iBeacon can be derived as follows:
such applications, a combination of several factors can be used
for the reward function such as closeness to the destination, RSSI = −(10n)log10 (d) + A (14)
shortest path, speed, speed variability, etc. The vehicle needs
where n is the signal propagation constant, d is the distance
to be trained on several test drives then it uses the large set
in meters, and A is the offset RSSI reading at 1 m from the
of unlabeled data to accurately navigate through the city.
transmitter.
Due to the importance of indoor localization and ease of
Due to fluctuations of the received signal strength, many
implementation, we showcase the proposed method on the
research studies that utilize RSSI fingerprinting perform a
localization problem in the context of smart campus, which is
preprocessing step to extract more representative features.
part of a larger smart city context. Despite the fact that indoor
Some of these preprocessing approaches include averaging
localization has been studied extensively in recent years, still
multiple RSSI values for the same location, use Gaussian dis-
it is an open problem bringing several challenges that need to
tribution model to filter outliers, and using PCA to reduce
be tackled.
the effect of noise in addition to offering new features. In
Indoor positioning systems have been proposed with differ-
this paper, we performed a categorization preprocessing in
ent technologies such as vision, visual light communications,
which an RSSI category represents a range of RSSI values.
infrared, ultrasound, WiFi, RFID, and BLE [40]. One deter-
We explain the exact procedure in Section V-B.
mining factor for organizations to choose a technology is
the cost of the underlying technologies and devices. Among
the aforementioned technologies, BLE is a low-cost solution A. Description of the Environment
that has attracted the attention for academic and commercial The environment is represented as a set of positions that
applications [9]. A combination of BLE and iBeacon technolo- are labeled by row and column numbers. Each position is also
gies to design an indoor location-aware system brings many associated with the set of RSSI values from the set of deployed
advantages to buildings that are not equipped with Wireless iBeacons. The agent observes the environment by receiving
networks. Since iBeacons devices are of a small form factor, RSSI values at each time. Our design requires the agent to
they can be deployed quickly and easily without changing or take action based on the three most recent RSSI observations.
even tapping into the building’s electrical and communications The agent can choose one of the allowed eight actions to
infrastructure [40]. move in different directions. In turn, the agent obtains a posi-
In recent years, deep learning has been shown to perform tive or negative reward according to its proximity to the right
favorably compared to other machine learning approaches. point. The goal of the agent is to approximate the position
One main challenge for deep learning is the need to col- of the device that has received the RSSI values from the
lect a large volume of labeled data (also known as calibration environment by moving in different directions.
procedure). Typically, scanning a large-scale area like a city To adopt a DRL approach, we need to define the following
or a campus to collect unlabeled data is fairly straightfor- elements for the MDP.
ward. Therefore, to benefit from the enormous volume of Environment: The active environment is a floor on which a
unlabeled data, we apply the semisupervised DRL approach particular position should be identified based on a vector of
Fig. 3. Illustration of a typical indoor environment for DRL.
TABLE I
L IST OF ACTIONS TO P ERFORM P OSITIONING
Fig. 4. Experimental setup with iBeacons.
iBeacon RSSI values. The environment is divided into a grid A. Dataset

of same-size cells as shown in Fig. 3.
Agent: The positioning algorithm itself is represented as an Our dataset is gathered from a real-world deployment of a
agent. The agent interacts with the environment over time. grid of iBeacons in a campus library area of 200 ft. ×180 ft.
States: The state of the agent is represented as a tuple of We mounted 13 iBeacons on the ceiling of the first floor of
these observations. Waldo Library at Western Michigan University which contains
1) A vector of RSSI values. many pillars that might deteriorate the iBeacons signals. So we
2) Current location (identified by row and column arranged the iBeacons such that we could get signal coverage
numbers). by several iBeacons. Each iBeacon is separated by a distance
3) Distance to the target (for labeled data). of 30–40 ft from adjacent iBeacons. To capture the signal
Actions: The action is to move to one of the neighboring strength indicator of these iBeacons, we divided the area into
cells in a direction of North, East, West, South or in between small zones by mapping a grid that has cells of size 10×10
directions like North West. The first action chooses a random square ft. We also developed a specific mobile app to capture
state in the grid. Table I shows the list of allowed actions. training data. For that purpose, we stood on each cell and
Reward Function: The reward function is the reciprocal of captured all the iBeacons’ received signals. We also manually
the distance error. The reward function has a positive value assigned the location (i.e., label of the cell) to the captured
if the distance to the target point is less than a threshold (δ). signals. We stored at least three instances of RSSIs for each
Otherwise, the agent receives a negative reward. Whenever cell to have a more reliable measurement and consequently
the agent is close to the target, it gains more rewards. On the to reduce the effect of noisy data. Overall, we collected 820
other hand, if the agent wanders away from the target and labeled data points for training, 600 data points for testing, and
its distance is larger than a threshold (δ), it gains a negative 5200 data points are unlabeled for semisupervised learning.
reward. The reward function is represented as follows:
1 B. Preprocessing
, if 0 < Ot − St ≤ δ
rt = Ot −St Our initial experiments with the raw RSSI values for super-
−Ot − St , otherwise vised deep learning showed that the relationship between the
in which Ot is the observed location and St is the target features are not truly revealed by deep learning models. So we
location. have enriched the features by adding two sets of features to
the original features. So we have three feature sets as follows.
1) Raw: The original features that come from the direct
V. E XPERIMENTAL R ESULTS RSSI readings.
Here, we describe our evaluation on a real world dataset. 2) S1: The set of features that represent the mutual dif-
Our experiments were carried on the first floor of Western ferences of iBeacon RSSI values; i.e., ri − rj ; ∀i, j ∈
Michigan University Waldo library. Fig. 4 shows the over- iBeacons & i = j, representing the difference between
all layout of the deployment site. In this paper, we use the the RSSI value of beacon i and beacon j.
iBeacon RSSI values to serve as the raw source of input 3) S2: The other set of features designed to represent the
data to identify indoor locations. Smartphones are also uti- categorical values of RSSIs in a Boolean membership
lized to sense the iBeacons’ signals and to compute the current mode such that for each beacon we define several cate-
position of the user with respect to the set of known iBea- gories by a specific interval (e.g., 10) and then represent
cons. Our model utilizes the semisupervised DRL algorithm each RSSI value with the category to which it belongs.
to learn from the historical patterns of RSSI values and their Table II shows the average accuracy of the different feature
corresponding estimated positions to improve its policy when sets during ten replications. These features are added to the
identifying a position based on previously unseen RSSI values. raw features. As can be seen from the table, adding features
Fig. 5. Obtaining rewards and distances in six episodes with a supervised and semisupervised DRL models.
TABLE II
ACCURACY OF D IFFERENT F EATURE S ETS IN A D EEP N EURAL N ETWORK variational autoencoder to generate more rewarding policies
and consequently increasing the accuracy of the localization
process. The deep neural networks are implemented on Google
TensorFlow [41] using the Keras package [42].
To evaluate the performance of the proposed semisupervised
DRL model, we performed two sets of experiments: one in
which the DRL framework uses a fully connected deep neural
network for supervised learning; and the other in which the
DRL framework uses a stacked variational autoencoder for
semisupervised learning.
set S1 to raw features has a minor effect on the average accu-
Fig. 5 shows the performance of the DRL in terms of the
racy. On the other hand, adding features set S2 increases the
received rewards as well as distance to the true target for both
average accuracy especially for finer grained positioning. Also,
supervised and semisupervised models in six episodes (see
the combination of S1 and S2 is not as good as using only
labels 1–6 on Fig. 5). In the plots, it can be seen that the agent
S2, since S1 lowers the accuracy. This observation points out
in the semisupervised model learns to achieve higher rewards
that enriching a feature set by pairwise differences of RSSI
or smaller distances to the target compared to the supervised
values (S1) has a minor negative effect on the accuracy of the
model.
model since those features are not solid discriminative factors.
Table III shows that the behavior of the semisupervised
The table also demonstrates that using S2 features when
model leads to getting closer to the target points compared
RSSI categorical interval is set to five leads to even better
to just relying on a supervised model. It also indicates faster
results. Therefore, based on these results we use the combi-
steps to reach or get close to the target in the same number of
nation of raw features and S2. Using this preprocessing, each
epochs. The differences of distances in this table emphasize
data point xi is represented as a vector of 13 RSSI values plus
that the semisupervised model generates policies that improve
156 range membership features (i.e., 12 range for the 13 bea-
the average convergence speed of the localization system by
cons) resulting in a total of 169 features: xi = (r1 , . . . , r169 ).
a factor of at least 4.
Each yi is a label of (row, col) pointing to a specific location.
In Figs. 6 and 7, the comparison of utilizing the semisu-
pervised model versus the supervised model along a different
C. Evaluation number of epochs shows the efficacy of the semisupervised
To implement our proposed semisupervised DRL model, approach in handling the localization problem. The results
we adopted the DRL algorithm in which we incorporated a in Fig. 6 show that the semisupervised model reaches a
TABLE III
AVERAGE S PEED OF C ONVERGENCE TO D ESTINATION P OINTS VI. C ONCLUSION
We proposed a semisupervised DRL framework as a learn-
ing mechanism in support of smart IoT services. The proposed
model uses a small set of labeled data along with a larger set of
unlabeled ones. This paper is the first attempt that extends the
semisupervised reinforcement learning approach using DRL.
The proposed model consists of a deep variational autoencoder
network that learns the best policies for taking optimal actions
by the agent.
As a use case, we experimented with the proposed model in
an indoor localization system. Our experimental results illus-
trate that the proposed semisupervised DRL model is able to
generalize the positioning policy for configurations where the
environment data is a mix of labeled and unlabeled data and
achieve better results compared to using a set of only labeled
data in a supervised model. The results show an improvement
of 23% on the localization accuracy in the proposed semisu-
pervised DRL model. Also, in terms of gaining rewards, the
semisupervised model outperforms the supervised model by
receiving at least 67% more rewards.
This paper shows that IoT applications in general, and
Fig. 6. Average rewards that are obtained by DRL over different epoch
smart city applications in specific where context-awareness is
counts using supervised model versus semisupervised model. a valuable asset can benefit immensely from unlabeled data to
improve the performance and accuracy of their learning agents.
Furthermore, the semisupervised DRL is a good solution for
many IoT applications since it requires little supervision by
giving a rewarding feedback as it learns the best policy to
choose among alternative actions.
ACKNOWLEDGMENT
The authors would like to thank Western Michigan
University Libraries for providing the experimental testbed and
space needed to conduct this research.
R EFERENCES
[1] A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and
M. Ayyash, “Internet of Things: A survey on enabling technologies, pro-
tocols, and applications,” IEEE Commun. Surveys Tuts., vol. 17, no. 4,
pp. 2347–2376, 4th Quart., 2015.
[2] N. E. Klepeis et al., “The national human activity pattern sur-
Fig. 7. Average distance to the target over different epoch counts using
vey (NHAPS): A resource for assessing exposure to environmental pol-
supervised model versus semisupervised model.
lutants,” J. Expo. Anal. Environ. Epidemiol., vol. 11, no. 3, pp. 231–252,
2001.
[3] L. Mainetti, V. Mighali, and L. Patrono, “A location-aware architecture
for heterogeneous building automation systems,” in Proc. IFIP/IEEE
higher reward faster compared to the supervised model while Int. Symp. Integr. Netw. Manag. (IM), Ottawa, ON, Canada, 2015,
keeping its rewards trend stable. From this figure, it can be pp. 1065–1070.
seen that the semisupervised model gains at least 67% more [4] S. Alletto et al., “An indoor location-aware system for an IoT-based
smart museum,” IEEE Internet Things J., vol. 3, no. 2, pp. 244–253,
rewards compared to the supervised model. In addition, the Apr. 2016.
semisupervised model achieves about twice the rewards of [5] M. V. Moreno, J. L. Hernández, and A. F. Skarmeta, “A new location-
the supervised model. This result can be translated to the aware authorization mechanism for indoor environments,” in Proc. 28th
Int. Conf. Adv. Inf. Netw. Appl. Workshops (WAINA), Victoria, BC,
original measurement where we want to know the effect of Canada, 2014, pp. 791–796.
the models on the accuracy of the localization as depicted in [6] G. Sunkada, “System for and method of location aware marketing,” U.S.
Fig. 7. Fig. 7 shows the average distance to the target points in Patent 12 851 968, Feb. 9, 2012.
[7] P. Dickinson, G. Cielniak, O. Szymanezyk, and M. Mannion, “Indoor
different number of epochs. Here, the semisupervised model positioning of shoppers using a network of Bluetooth low energy
achieves 6%–23% improvement for localization. This result beacons,” in Proc. Int. Conf. Indoor Position. Indoor Navig. (IPIN),
indicates that the unlabeled data helps the VAE to better iden- Alcalá de Henares, Spain, 2016, pp. 1–8.
[8] J. Torres-Sospedra et al., “Enhancing integrated indoor/outdoor mobil-
tify the discriminative boundaries and consequently improves ity in a smart campus,” Int. J. Geograph. Inf. Sci., vol. 29, no. 11,
the accuracy of the semisupervised model. pp. 1955–1968, 2015.
[9] F. Zafari, I. Papapanagiotou, and K. Christidis, “Microlocation for [34] X. Zhang, J. Wang, Q. Gao, X. Ma, and H. Wang, “Device-free wireless
Internet-of-Things-equipped smart buildings,” IEEE Internet Things J., localization and activity recognition with deep learning,” in Proc. IEEE
vol. 3, no. 1, pp. 96–112, Feb. 2016. Int. Conf. Pervasive Comput. Commun. Workshops (PerCom Workshops),
[10] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, Sydney, NSW, Australia, 2016, pp. 1–5.
no. 7553, pp. 436–444, 2015. [35] T. Pulkkinen, T. Roos, and P. Myllymäki, “Semi-supervised learning
[11] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, for WLAN positioning,” in Proc. Int. Conf. Artif. Neural Netw., Espoo,
vol. 1. Cambridge, MA, USA: MIT Press, 1998. Finland, 2011, pp. 355–362.
[12] V. Mnih et al., “Playing Atari with deep reinforcement learning,” arXiv [36] J. Weston, F. Ratle, H. Mobahi, and R. Collobert, “Deep learning via
preprint arXiv:1312.5602v1 [cs.LG], 2013. semi-supervised embedding,” in Neural Networks: Tricks of the Trade.
[13] R. Faragher and R. Harle, “Location fingerprinting with Bluetooth Heidelberg, Germany: Springer, 2012, pp. 639–655.
low energy beacons,” IEEE J. Sel. Areas Commun., vol. 33, no. 11, [37] O. Chapelle, B. Schölkopf, and A. Zien, Semi-Supervised Learning.
pp. 2418–2428, Nov. 2015. Cambridge, MA, USA: MIT Press, 2006.
[14] P. Christiano. (2016). Semi-Supervised Reinforcement Learning. [38] J. Walker, C. Doersch, A. Gupta, and M. Hebert, “An uncertain
[Online]. Available: https://medium.com/ai-control/semi-supervised- future: Forecasting from static images using variational autoencoders,”
reinforcement-learning-cf7d5375197f in Proc. Eur. Conf. Comput. Vis., Amsterdam, The Netherlands, 2016,
[15] D. Amodei et al., “Concrete problems in AI safety,” arXiv preprint pp. 835–851.
arXiv:1606.06565v2 [cs.AI], 2016. [39] V. Mnih et al., “Human-level control through deep reinforcement
[16] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi- learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
supervised learning with deep generative models,” in Proc. Adv. Neural [40] L. Mainetti, L. Patrono, and I. Sergi, “A survey on indoor position-
Inf. Process. Syst., Montreal, QC, Canada, 2014, pp. 3581–3589. ing systems,” in Proc. 22nd Int. Conf. Softw. Telecommun. Comput.
[17] S. Nemati, M. M. Ghassemi, and G. D. Clifford, “Optimal medication Netw. (SoftCOM), Split, Croatia, 2014, pp. 111–120.
dosing from suboptimal clinical examples: A deep reinforcement learn- [41] M. Abadi et al., “Tensorflow: Large-scale machine learning on hetero-
ing approach,” in Proc. IEEE 38th Annu. Int. Conf. Eng. Med. Biol. geneous distributed systems,” arXiv:1603.04467v2 [cs.DC], 2016.
Soc. (EMBC), Orlando, FL, USA, 2016, pp. 2978–2981. [42] F. Chollet. (2015). Keras: Deep Learning Library for Theano and
[18] D. Zhao, Y. Chen, and L. Lv, “Deep reinforcement learning with visual TensorFlow. [Online]. Available: https://keras.io/
attention for vehicle classification,” IEEE Trans. Cogn. Develop. Syst.,
to be published.
[19] J. C. Caicedo and S. Lazebnik, “Active object localization with deep Mehdi Mohammadi (GS’14) received the B.S.
reinforcement learning,” in Proc. IEEE Int. Conf. Comput. Vis., Santiago, degree in computer engineering from Kharazmi
Chile, 2015, pp. 2488–2496. University, Tehran, Iran, in 2003 and the M.S.
[20] Y. Zhu et al., “Target-driven visual navigation in indoor scenes degree in computer engineering (software) from
using deep reinforcement learning,” arXiv preprint arXiv:1609.05143v1 Sheikhbahaee University, Isfahan, Iran, in 2010.
[cs.CV], 2016. He is currently pursuing the Ph.D. degree at the
[21] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforce- Department of Computer Science, Western Michigan
ment learning,” IEEE/CAA J. Autom. Sinica, vol. 3, no. 3, pp. 247–254, University (WMU), Kalamazoo, MI, USA.
Jul. 2016. His current research interests include Internet of
[22] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource manage- Things, cloud computing, machine learning, and nat-
ment with deep reinforcement learning,” in Proc. 15th ACM Workshop ural language processing.
Hot Topics Netw., Atlanta, GA, USA, 2016, pp. 50–56. Mr. Mohammadi has been a recipient of the Graduate Doctoral
[23] K. Narasimhan, T. Kulkarni, and R. Barzilay, “Language understanding Assistantship from the WMU Libraries Information Technology since 2013
for text-based games using deep reinforcement learning,” arXiv preprint and six travel grants from the National Science Foundation. He served as
arXiv:1506.08941v2 [cs.CL], 2015. a Reviewer for several journals including IEEE Communications Magazine,
[24] J. He et al., “Deep reinforcement learning with a natural language action IEEE C OMMUNICATIONS L ETTERS, Wiley’s Security and Wireless
space,” arXiv preprint arXiv:1511.04636v5 [cs.AI], 2015. Communication Networks journal, and Wiley’s Wireless Communications and
[25] V. François-Lavet, D. Taralla, D. Ernst, and R. Fonteneau, “Deep rein- Mobile Computing journal.
forcement learning solutions for energy microgrids management,” in
Proc. Eur. Workshop Reinforcement Learn., 2016, pp. 1–7.
[26] B. Wang, Q. Chen, L. T. Yang, and H.-C. Chao, “Indoor smartphone Ala Al-Fuqaha (S’00–M’04–SM’09) received
localization via fingerprint crowdsourcing: Challenges and approaches,” the M.S. degree from the University of
IEEE Wireless Commun., vol. 23, no. 3, pp. 82–89, Jun. 2016. Missouri–Columbia, Columbia, MO, USA,
[27] W. Zhang, K. Liu, W. Zhang, Y. Zhang, and J. Gu, “Deep neural in 1999, and the Ph.D. degree in electrical and
networks for wireless localization in indoor and outdoor environments,” computer engineering from the University of
Neurocomputing, vol. 194, pp. 279–287, Jun. 2016. Missouri–Kansas City, Kansas City, MO, USA, in
[28] S. Kajioka, T. Mori, T. Uchiya, I. Takumi, and H. Matsuo, “Experiment 2004.
of indoor position presumption based on RSSI of Bluetooth LE beacon,” He is currently a Professor and the Director
in Proc. IEEE 3rd Glob. Conf. Consum. Electron. (GCCE), Tokyo, Japan, with the NEST Research Laboratory, Computer
2014, pp. 337–339. Science Department, Western Michigan University,
[29] Y. Wang, X. Yang, Y. Zhao, Y. Liu, and L. Cuthbert, “Bluetooth posi- Kalamazoo, MI, USA.
tioning using RSSI and triangulation methods,” in Proc. IEEE 10th Dr. Al-Fuqaha served as the Principal Investigator or the Co-Principal
Consum. Commun. Netw. Conf. (CCNC), Las Vegas, NV, USA, 2013, Investigator on multiple research projects funded by the NSF, Qatar
pp. 837–842. Foundation, Cisco, Boeing, AVL, Stryker, Wolverine, Traumasoft, and
[30] X. Wang, L. Gao, S. Mao, and S. Pandey, “Deepfi: Deep learning for Western Michigan University. His research grant activities and collaborative
indoor fingerprinting using channel state information,” in Proc. IEEE efforts have culminated in $2.8 million funding. Chief among these research
Wireless Commun. Netw. Conf. (WCNC), New Orleans, LA, USA, 2015, activities is an international collaborative effort with Qatar University and
pp. 1666–1671. Purdue University to study the interplay between safety, security, and
[31] Y. Gu, Y. Chen, J. Liu, and X. Jiang, “Semi-supervised deep extreme performance in vehicular networks. Another chief research activity is an
learning machine for Wi-Fi based localization,” Neurocomputing, NSF funded collaborative effort with the City University of New York
vol. 166, pp. 282–293, Oct. 2015. and the University of Nebraska Lincoln on bio-socially inspired techniques
[32] G. Ding, Z. Tan, J. Zhang, and L. Zhang, “Fingerprinting localization for enhanced spectrum access in cognitive radio networks. He is currently
based on affinity propagation clustering and artificial neural networks,” serving on the Editorial Board for Wiley’s Security and Communication
in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC), Shanghai, China, Networks journal, Wiley’s Wireless Communications and Mobile Computing
2013, pp. 2317–2322. Journal, Industrial Networks and Intelligent Systems journal, the International
[33] J. Luo and H. Gao, “Deep belief networks for fingerprinting indoor local- Journal of Computing and Digital Systems, and the International Journal of
ization using ultrawideband technology,” Int. J. Distrib. Sensor Netw., Multimedia. He has served as a Technical Program Committee member and
vol. 2016, pp. 1–8, Jan. 2016. a Reviewer of many international conferences and journals.
Mohsen Guizani (S’85–M’89–SM’99–F’09) Jun-Seok Oh received the Ph.D. degree in civil

received the B.S. (with distinction) and M.S. engineering from the University of California at
degrees in electrical engineering and M.S. and Irvine, Irvine, CA, USA, in 2001.
Ph.D. degrees in computer engineering from He is currently a Professor with the Department
Syracuse University, Syracuse, NY, USA, in 1984, of Civil and Construction Engineering and the
1986, 1987, and 1990, respectively. Director of the Transportation Research Center
He is currently a Professor and the ECE for Livable Communities, Western Michigan
Department Chair with the University of Idaho, University, Kalamazoo, MI, USA. His current
Moscow, ID, USA. Previously, he served as the research interests include optimization theories,
Associate Vice President of Graduate Studies, techniques to advanced transportation systems,
Qatar University, Chair of the Computer Science dynamic traffic-simulation models, dynamic
Department, Western Michigan University, and Chair of the Computer route-guidance systems, adaptive traffic control, nonmotorized transportation
Science Department, University of West Florida. He also served in academic systems, and advanced traffic-monitoring systems.
positions with the University of Missouri–Kansas City, University of
Colorado–Boulder, Syracuse University, and Kuwait University. He is the
author of 9 books and more than 450 publications in refereed journals and
conferences. His research interests include wireless communications and
mobile computing, computer networks, mobile cloud computing, security,
and smart grids.
Dr. Guizani currently serves on the Editorial Boards of several international
technical journals and is the counder and the Editor-in-Chief of the Wireless
Communications and Mobile Computing journal (Wiley). He has guest
edited a number of special issues in IEEE journals and magazines. He also
served as a member, Chair, and General Chair of a number of international
conferences. He was the Chair of the IEEE Communications Society Wireless
Technical Committee and the Chair of the TAOS Technical Committee. He
served as the IEEE Computer Society Distinguished Speaker from 2003 to
2005. He is a Fellow of IEEE and a Senior Member of ACM. He was the
recipient of the Teaching Award multiple times from different institutions, as
well as the Best Research Award from three institutions.

Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services

Uploaded by

Copyright:

Available Formats

624 IEEE INTERNET OF THINGS JOURNAL, VOL. 5, NO.