Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Real-time Probabilistic approach for traffic prediction

on IoT data streams

Sanket Mishra, Raghunthan Balan, Ankit Shibu, and Chittaranjan Hota

Department of Computer Science & Information Systems,


BITS Pilani Hyderabad Campus, Hyderabad, Telangana, India
{p20150408,f20170703,f20170297,hota}@hyderabad.bits-pilani.ac.in

Abstract. IoT data analytics refers to an analysis of voluminous data that are
captured by connected devices. These devices interact with the environment and
capture the details which is streamed to a central repository where the processing
of this data is done. The collected data maybe heterogeneous in nature as research
has identified weather data, social data, and pollution data as key players in traf-
fic prediction in smart cities which makes the analytics challenging. IoT data is
also prone to unreliable data because of the connectivity medium which, in most
cases, is wireless in nature. This may lead to packet loss and network disconnec-
tions and thus, create missing data issues and noise in the data. In this work, we
propose UNIR, an event driven framework for analyzing heterogeneous IoT data
streams. In the first step, we ingest the data from Twitter, weather and traffic APIs
and persist in a data store. Later, this data is preprocessed and analyzed by deep
models to forecast future data points. In the second step, a supervised Hidden
Markov Model consumes the sequence of predicted data points from first layer.
The HMM is trained using the ground truth labels obtained from TomTom API
for creating likelihood value of a congestion event. The likelihood for congestion
and non-congestion sequences is learned by a Logistic regression which assigns a
confidence of an event occurrence. Experiments depict a higher accuracy of over
77% of the proposed approach over the baseline approach.

Keywords: Deep learning, Long Short Term Memory networks, Complex Event
Processing, Hidden Markov Models, Probabilistic approach

1 Introduction

Internet of Things (IoT) depicts a world of interconnected ‘things’ that has lead to a
significant rise in the number of devices being connected to Internet. Presently, sen-
sors range from generic sensing devices attached to embedded IoT boards to in-built
sensors present in the smartphone. Availability of low cost hardware and sensing de-
vices with a splurge in Internet connectivity has propelled the process of development
of complex IoT applications in this area. In ITS scenarios, data is large scale and poses
a big data problem. ITS applications gather large data for inferring patterns from these
data streams. Depending upon the underlying application, the complex patterns can be
transformed as complex events using a Complex Event Processing (CEP) engine. This
research area involves data processing, fusion and generation of complex events from
2 Sanket Mishra et al.

various univariate data streams for real-time congestion prediction in ITS scenarios. But
in ITS research, a very important aspect is the prediction of congestion from varied het-
erogeneous sources, as ‘context’ plays a crucial role. For example, congestion noticed
on weekdays during morning hours is different from congestion noticed on weekends
in morning hours. Rain or fog also constitutes a component in prediction as rain reduces
visibility and thus, leads to lesser density of vehicles on road impacting congestion.
Wang et al. [9] propose a probabilistic CEP on streaming data to handle the uncer-
tainty caused due to noise, data loss or sensor error. The work illustrated a hierarchical
and parallel complex event processing approach on data streams and extended the Non-
deterministic Finite Automata (NFA) approach to handle probabilistic event processing
on standalone and distributed modes. Akbar et al. [1] develop a predictive CEP that
uses Adaptive Moving Window Regression (AMWR) to predict simple events. The re-
gression approaches ingest data using an adaptive window based on the Mean Absolute
Percentage Error (MAPE) that increases or decreases to contain the error within 5%-
20%. The predicted data points were sent to the CEP engine which fuses them to form
complex ”congestion” events well before their occurrence. However, it is found in the
presence of noise and non-stationarity in data, the performance of the model is poor.
Mishra et al. [7] identified this shortcoming and proposed an end to end solution using
LSTM approach for forecasting future points and merging them in a CEP to generate
complex ‘congestion’ events. The authors [1] improve their work [6] and formulate ad-
vanced scenarios for congestion prediction with the inclusion of social data from Twit-
ter. The Twitter data was used to formulate Large Crowd Concentration (LCC) events
which signify large density of people in certain locations. The authors analyzed the im-
pact of such social data in congestion prediction and enhanced it [2] using Complex
Event Processing to create probabilistic events concerning congestion prediction. This
paper is an improvement over [2] that employs a Bayesian Belief Network (BBN) for
identifying causal relationships between traffic data, weather data, temporal data, so-
cial data and their conditional probabilities by counting the number of instances of the
particular attribute for congestion and dividing that by a total count.
Our major contributions in this work are as follows:

– Our significant contribution in this work is the formulation of Unir1 , modular prob-
abilistic framework to address uncertainty in detection of complex events in CEP
with the help of sensor data fusion. We also merge HMM-LR (Hidden Markov
Model-Logistic Regression) to identify probability of occurrence of a complex
event by analyzing sequences.
– We validate the effectiveness of the proposed approach on a real world ITS use-case
against a Bayesian Belief Network approach proposed in an earlier work.

1
‘Unir’ is a Spanish word which means merge or join
Real-time Probabilistic approach for traffic prediction on IoT data streams 3

1.1 Dataset Description


The data is taken from the city of Madrid from September 2019 to November 20192 .
The traffic data is acquired through a REST (REpresentational State Transfer) API (Ap-
plication Program Interface)3 , the weather data of all regions from an API web service4
and Twitter data5 to aggregate the count of tweets coming from considered locations.
The data is collected in a interval of five minutes. The various traffic attributes and their
respective descriptions are outlined in Table 1.

Table 1. Traffic Dataset

Attribute Description
ID Represents the identification number for various streets in Madrid.
timestamp Represents the temporal parameters(time and date) for the data.
intensity Signifies the total number of vehicles on the road at a particular location.
velocity Represents the mean velocity of vehicles on the road at a particular time instant.
weather Presents the weather conditions of the particular location in terms of clear/sunny, rain, fog, humidity and visibility.
tweet count Mentions the count of tweets coming from a particular region signifying the density of people in that region.

The dataset comprises of 24000 data points out of which 17000 points were consid-
ered for training and 7000 points were considered for testing the model. The traffic data
is captured from the REST API every 5 minutes. The weather data was down-sampled
to 5 minutes to be in tandem with the traffic data. The tweets are collected through Twit-
ter API and tweet counts represent the aggregated tweets every 5 minutes. The traffic,
weather and social data are collected and stored in a time series database. The traffic
data has a 5 minute interval and tweets were arriving every second. Tweets were aggre-
gated every 5 minutes for creating the ‘tweet count’ variable and the weather data was
grouped into ‘cloudy’, ‘rainy’ and ‘clear’ classes. The weather data was downsampled
from 15 minutes to 5 minutes. The data is initially segregated on the basis of timestamps
to represent different “contexts” of the day. The data is initially grouped on the basis
of weekday and weekend. Then on the basis of time, the data is divided into morning,
afternoon, evening and late evening times.

1.2 Data preprocessing


Fig. 1 represents the initial analysis6 on the data to gain a meaningful understanding
about the characteristics of the data. The traffic intensity and tweet counts are marked
to be highest during afternoon and lowest in late evening during weekdays but com-
paratively lesser during weekends. However, the velocity is lowest during evening on
weekdays and highest on weekends when intensity is low.
2
The work has also been executed on four other traffic locations not depicted in this paper.
Experimental results signify the better performance of the proposed approach against baseline
approaches over other locations too.
3
http://informo.munimadrid.es/informo/tmadrid/pm.xml
4
https://api.darksky.net/forecast/
5
https://api.twitter.com/1.1/search/tweets.json
6
legend : wd/we = weekday/weekend, m/a/e/le = morning/afternoon/evening/late evening.
4 Sanket Mishra et al.





0HDQ,QWHQVLW\

0HDQ9HORFLW\

0HDQ7ZHHWV











  
PZG PZH DZG DZH HZG HZH OHZG OHZH PZG PZH DZG DZH HZG HZH OHZG OHZH PZG PZH DZG DZH HZG HZH OHZG OHZH

7LPHVRI:HHN 7LPHVRI:HHN 7LPHVRI:HHN

(a) (b) (c)

Fig. 1. a) Intensity in the morning on weekdays and weekends b) Velocity in the morning on
weekdays and weekends c)Tweet count in the morning on weekdays and weekends

LQWHQVLW\

YHORFLW\
LOESS fit neighbours=10
WZHHWBFRXQW 2000

ZHHNHQG

ZHHNGD\ 

PRUQLQJ í 1500


DIWHUQRRQ í

Intensity
HYHQLQJ
í
ODWHHYHQLQJ 1000
í
FOHDU

UDLQ\

IRJ 500
LQWHQVLW\

WZHHWBFRXQW
YHORFLW\

ZHHNHQG

ZHHNGD\

DIWHUQRRQ
PRUQLQJ

HYHQLQJ

ODWHHYHQLQJ

FOHDU

UDLQ\

IRJ

0
0 30 60 90
Time

(a) (b)

Fig. 2. a) Pearson correlation coefficient b) loess smoothed intensity data

Fig. 2(a) shows the Pearson Correlation technique to identify linear relationships
amongst the features used in the concocted work. The correlation coefficient varies be-
tween -1 to +1, where +1 signifies positive linear correlation, 0 depicts no correlation
and -1 represents negative linear correlation. From Fig. 2(a), we conclude that intensity
and tweet count are positively correlated and intensity and velocity are negatively cor-
related. Even weekday-weekend and fog-clear are negatively correlated. In real world
data, noise occurs randomly due to faults in sensor or data loss which can be detrimen-
tal for a forecasting model. The sensors that continuously capture the traffic parameters
may give erroneous values owing to a dead battery or physical damage caused to the
IoT sensor which can cause missing values in the data. The mice library is used to
impute the missing values in the given dataset. For removal of noise, we incorporate
a locally estimated scatterplot smoothing (loess) technique to smoothen the raw IoT
data. The smoothing mechanism eliminates noise and irregularities in the raw data and
thus, form better quality continuous data. Fig. 2(b) illustrates the smoothed data7 for
10 neighbors on intensity attribute. This helps in better performance of the regression
approaches as it enables them to generalize unknown data. The smoothing mechanism
7
To find the optimal neighbors, the loess algorithm is varied from 5, 10, 15 and 20 neighbors
over the raw data for intensity, velocity and tweet count not depicted in this paper.
Real-time Probabilistic approach for traffic prediction on IoT data streams 5

is applied on historical data only and the raw data is considered for testing. We use
Min-Max normalization approach for feature scaling in this work.
In the next section, we describe the various modules of UNIR and outline their
functionalities.

2 UNIR : Deep Fusion Statistical Framework

Weather data
subscribe Deep Regression predicted data
CEP
Approaches (LSTM)
Messaging Messaging
publish Broker (Apache Broker (Apache
Kafka) Kafka)
predicted sequence Probabilistic congestion event

Twitter Stream Traffic data (et, pt) (event probability pairs)


Output (End user)

HMM1 HMM0

congestion (LL1) no-congestion (LL0)

Logistic prediction confidence (p)


Regression

Fig. 3. Proposed Framework

Fig. 3 depicts our proposed framework UNIR that is implemented for probabilistic
event generation using CEP. First, the data acquisition, preprocessing and ingestion
is handled by Kafka and subsequently sent to CEP engine. Secondly, the framework
handles the probabilistic inference mechanism on simple events and predictions for
formulating probabilistic events. In the following section, we elaborate on the working
mechanism of UNIR.

2.1 Data Acquisition

A Kafka producer collects the data from time series database and publishes it onto a
topic. The process of data collection and pushing to CEP is done simultaneously. The
Kafka broker is deployed on one thread while the CEP executes concurrently on another
thread at the same time. The topic is subscribed by the Kafka consumer attached to the
regression approach so that it can fetch the data streams and send it to CEP for event
processing. This concurrent execution using different threads avoids starvation of the
CEP engine which may hamper event generation and ML predictions.
6 Sanket Mishra et al.

2.2 Predictive Analytics

The proposed solution, UNIR, is a practical and feasible framework for handling predic-
tion on real-world IoT applications. It consists of modules that can be easily integrated
to form a event processing framework. In this work, we counter the def acto standard
of the CEP engines and propose a probabilistic CEP which can not only handle hetero-
geneity in IoT streams but also predict events with a greater accuracy.

FCN: Fully Connected


Network
Xi LSTM LR: Logistic Regression
HMM: Hidden Markov
Model
LL: log likelihood
CC: congestion confidence

FCN

Xi+1 Xi+1

HMM0 HMM0

LL0 LL1

LR

CCi+1

Fig. 4. Proposed approach

In the proposed methodology, our predictive analytics module has three parts. Fig.
4 depicts the proposed model. In the first part, we employ a Kafka consumer which
fetches the aggregated data from the databases and feeds it to the regression models.
We implemented a dense LSTM regression for predicting future data points on the
traffic attributes. The LSTM model is trained on weather data, time of day and time
of week to forecast the value of intensity, velocity and tweet counts. Neural networks
have the intrinsic ability to identify patterns or relationships between various attributes
in non-linear data. Most models perform poor on real world datasets and fail to identify
patterns in the data. The context in which a prediction takes place may change over
the time. This leads to a phenomenon called “concept drift”. As the models predict
on IoT streams, we handle concept drift by using a heuristic proposed in our earlier
work [7]. As the window is moving, the model is trained on the new data instances.
The performance of the LSTM is validated by using a standard evaluation metric, such
as, Mean Absolute Percentage Error (MAPE). Based on the magnitude of the error,
the model is retrained. Once the model generates the predictions, it is ingested via a
Real-time Probabilistic approach for traffic prediction on IoT data streams 7

Kafka consumer into the CEP which fuses these data instances using EPL queries to
predict a complex event much before its occurrence. But because of uncertainty in real
world datasets and heterogeneity in data [2], the inference of sequences that depict
‘congestion’ patterns are required.
The second part of our framework comprises of a supervised HMM model. The
probabilistic HMM approach consists of two different HMMs, S1 and S2 that feed on
“no congestion” and “congestion” sequences respectively. These sequences are the pre-
dicted data points of intensity, velocity and tweet count obtained as an output from
the regression models. The HMMs use Gaussian emission function8 to determine the
probability of observation for a particular output sequence with respect to a particular
hidden state9 . The Gaussian emission function processes the overall output sequence
as observed from the hidden state sequence of the HMM. We take the help of TomTom
API10 for fetching the true “congestion” events in the city of Madrid. This is a necessary
step for identification of the true congestion events not considered in the baseline ap-
proach [2]. Our objective in this work is given an input sequence, we need to ascertain
whether it belongs to congestion event or no congestion event. In order to substantiate
this assumption, we train S1 and S2 with input sequences corresponding to no conges-
tion and congestion events respectively (obtained from TomTom API). As the HMMs
are trained on the input sequences, they are able to infer the probabilistic distribution
of the sequences relevant to congestion and no congestion scenarios. When a new input
sequence ‘X’ arrives, S1 gives the probability P (nocongestion|X) and S2 generates
P (congestion|X). Thus, S1 and S2 give “output confidence”, that is, C1 and C2 for no
congestion and congestion events, respectively. These values obtained from S1 and S2
are not probabilities rather log likelihood values obtained using Maximum Likelihood
Estimation. In this work, our proposition is the creation of a probabilistic CEP, that is,
each complex event generated will occur with a certain degree of probability. For ob-
taining the probability from the log likelihood values, we implement the final phase of
the predictive analytics component.
For obtaining probabilities from the log likelihood values, we employ a Logis-
tic Regression approach. For a dataset D, with input sequences X1 , X2 , X3 ,...,Xn ,
where Xi is a l × d matrix, l is the sequence length of HMM and d is the number
of features in the input. We obtain corresponding confidence pairs < C11 , C12 >,
< C21 , C22 >..........< Cn1 , Cn2 > for each input sequence X1 , X2 , X3 ,...,Xn . We
use the ground truth labels g1 , g2 , g3 ,...,gn where gi = 0 and gi = 1 represent no con-
gestion and congestion, respectively. The logistic regression trains on the confidence
pairs with the ground truth label as the target. Using this methodology, we can identify
‘p’, the probability output for a new input sequence from its corresponding confidence
pairs. The output of the logistic regression represents the confidence of a congestion
event.

8
The approach is also experimented with Gaussian Mixture Model (GMM) emission function
that depict the better performance of the Gaussian emission function over GMM emission
function.
9
Number of states in HMM are determined using Akaike Information Criteria and Bayesian
Information Criteria not depicted in this work.
10
https://developer.tomtom.com/traffic-api
8 Sanket Mishra et al.

The next section outlines the various evaluation criteria that have been considered
in this work for verifying the efficiency of the proposed approach.

2.3 Evaluation Metrics

The LSTM regression approach consumes the data streams and is trained on traffic fea-
tures, climate data, social data and temporal data. LSTM forecasts the traffic attributes,
intensity and velocity and social data, that is , tweet counts for a particular region. To
verify the performance of the forecasting task, we use Mean Absolute Percentage Error
(MAPE) metric11 . They are defined as follows:
n
1 X |yi − yi0 |
M AP E = (1)
n i=1 yi

where yi is the observed value and yi0 is the predicted value


For validating the performance of the proposed HMM-LR approach, we considered
the “accuracy” metric12 .

TP + TN
Accuracy = (2)
TP + FP + TN + FN
where, TP, FP, TN and FN represent True Positive, False Positive, True Negative
and False Negative.
The next section exhibits various experiments conducted to identify the optimal
parameters for LSTM and HMM. It also presents the sensitivity analysis on the HMM
approach to find the optimal set of features and sequence length for best performance.

3 Experimental Results and Discussion

Various experiments were executed to discover the optimal parameters of LSTM and
HMM. For identifying the optimal hyperparamaters of the LSTM, HyperOpt library
is used. The optimal hyperparameters are marked at points where the metric exhibits
best performance.
We consider four different regression approaches as baselines for comparison against
the performance of our proposed approach. The four baselines considered in this work
are:
– Bidirectional LSTM [3] [4]: It trains two LSTM networks one on the forward and
the other on the reversed sequence.
– Adaptive Moving Window Regression [1]: It involves a variable sized sliding win-
dow which dictates the input sequence to a regressor (SVR in this case) which is
trained on this window to generate output.
11
Extensive experimentation has also been conducted using 5 fold cross-validation on the basis
of MAE, RMSE and R2 metrics depicting similar outcomes.
12
We have also evaluated the HMM-LR approach on other evaluation metrics such as precision,
recall, f-measure and AUC (Area Under Curve) that is not exhibited in this work.
Real-time Probabilistic approach for traffic prediction on IoT data streams 9

– Stacked LSTM [10]: It consists of multiple LSTM layers where LSTM layers give
a higher return sequence output that is fed to the next LSTM layer.
– CNN-LSTM [5]: It includes a Convolutional Neural Network for extracting fea-
tures from input and then applying LSTM for sequence prediction.

MAPE_intensity
40 model AMWR Bidirectional LSTM CNN−LSTM Stacked LSTM Proposed Approach

* ns ns ns **
35

30 30

mape_intensity
25
MAPE

20
20

15 10

10

ch
M
W

ST

ST

oa
LS
AM

lL

−L

pr
d

Ap
na

ke
N
io

ac

d
C
ct

e
0

St

os
re
di

op
Bi
Bidirectional LSTM AMWR Stacked LSTM CNN LSTM Proposed Model

Pr
model

(a) (b)

Fig. 5. a) MAPE performance on various regression approaches on intensity b)Statistical Signifi-


cance testing comparing the performance of the regression approaches

We follow a similar training and evaluation protocol for all models, using HyperOpt
to determine hyper-parameters, such as, learning rate, number of hidden layers, etc. We
use a learning rate of 1e−3 for all neural networks. For BiLSTM and stacked LSTM,
we include 3 hidden layers and 16 hidden nodes along with a dropout of 0.25. For
CNN-LSTM, we use 1D convolutional layer and 1D max pooling layer with 64 filters of
kernel size 5, which is followed by a LSTM layer. Nevertheless, we use ReLU (Rectified
Linear Unit) as the activation function in hidden layers and linear in output layer. For
AMWR [1], we use initial training window of 15, radial basis function as the kernel
for the SVR regressor and gamma of 0.1. For the proposed LSTM regression, we use 4
hidden layers with 8 hidden nodes each with a lookback of 5.
We validate the performance of the regression approaches considered in this work
on the basis of MAPE13 . The results of MAPE metrics are depicted in Fig. 5(a). The
proposed approach gives the best performance using MAPE metric and it also surpasses
the performance of the baselines including the approach [1]. We implement a Kruskal-
Wallis significance test [8] to verify the statistical significance between the regression
approaches. In this test, our null hypothesis states that there is no statistical significance
between the regression approaches and the performance of proposed approach is dif-
ferent from the comparative approaches. Our alternate hypothesis states that there is a
similarity in the performance of the regression approaches considered in this work. Fig.
5(b) rejects the alternate hypothesis and exhibits that there is no statistical significance
between considered regression techniques.
The LSTM approach forecasts the data points and sends it to the CEP engine. In the
CEP engine, the data points are ‘merged’ or ‘fused’ to generate complex events. But
13
Similar performance is also exhibited in terms of R2 , MAE and RMSE metrics.
10 Sanket Mishra et al.

the congestion in traffic is also influenced by temporal, weather and social parameters.
We propose a supervised HMM which takes the predicted data points of the regression
approach and estimates the log likelihood of each occurrence of “congestion” or “no
congestion” event based on the input sequence using Maximum Likelihood Estimation.
The ground truth or labels identifying congestion occurrence in the particular region is
obtained using TomTom API.
In the regression part, we use a FCN (Fully Connected Network) and LSTM net-
work and dropout layers for regularization. LSTMs are very good at handling sequential
data which makes it an appropriate choice for predictions in traffic scenario where in-
put is a sequential time series data. We integrate the output of LSTM to a FCN (fully
connected network).
As described previously, given a input sequence X = [x1 , x2 , ..., xt , ..., xT ] we com-
pute hT using the equations governing the LSTM architecture.
FCN:

intT +1 = Wint · hT + bint (3)

velT +1 = Wvel · hT + bvel (4)

tcT +1 = Wtc · hT + btc (5)


where int, vel, tc represent intensity, velocity and tweet count respectively. This
output is then ingested by the HMM-LR for further computation of probabilistic events.
The HMM is a probabilistic approach that takes as input the predicted data obtained
from regression approaches and identifies the maximum likelihood that is further pro-
cessed by a logistic regression with ground truth to assign a certain probability to the
occurrence of a complex (congestion) event. The computation of probability associ-
ated with congestion event is dependent on context, that is, weather conditions, social
network activity and time of day/week.
Fig. 6(a) represents the evaluation on the performance of the HMM-LR on different
features14 . It evaluates the efficiency of the approach on the basis of accuracy15 . It is
noticed that with the combination of traffic data,i.e, intensity, velocity and tweet count,
the model exhibits highest accuracy of 77%. The input sequence of predicted data points
that the HMM consumes for assigning probability to the complex events can impact the
overall performance. Fig. 6(b) exhibits that with an optimal sequence length of ‘5’,
the model exhibits a better performance. It is noticed that with a sequence length of
‘5’, the number of true positives is 534 and the number of false positives is 309. The
number of false positive subsequently increases with an increase in length leading to a
significant loss of accuracy. When sequence is of length 12, we notice 494 true positives
and 413 false positives but the accuracy drops to 71%. The sequence length affects the
classification accuracy as smaller sequences result in faster predictions. The smaller
sequences also hold the most recent data points on which the models are trained for
classification.
14
i, v and tc represent intensity, velocity and tweet counts respectively.
15
Similar results were obtained for precision, recall, f-measure and AUC.
Real-time Probabilistic approach for traffic prediction on IoT data streams 11





$FFXUDF\ 

$FFXUDF\







L

WF

LWF

LY

YWF

Y

LYWF























)HDWXUHV /HQJWKRI6HTXHQFH

(a) (b)

Fig. 6. a) Performance Evaluation of HMM on various feature combinations b) Performance eval-


uation of HMM over varying sequence lengths

Model Proposed BBN


Accuracy
0.81 Wilcoxon, p = 0.0079

0.80 0.80

0.79
Accuracy

0.78
Accuracy

0.78

0.77
0.76
0.76

0.75
0.74

bbn proposed_model Proposed BBN


Model

(a) (b)

Fig. 7. Performance Evaluation of proposed approach over Bayesian Belief Network [2] on the
basis of evaluation metrics of accuracy metric b) Statistical Significance test of proposed approach
and baseline approach

Fig. 7 presents the comparison16 in performance between the proposed approach


and a baseline approach Bayesian Belief Network (BBN) [2]. The BBN computes
the conditional probability of each of the traffic features, that is, intensity, velocity
and tweet count and constructs a Bayesian Network. It also considers external factors
like weather and time for predicting congestion. We have bench-marked our approach
against the Bayesian network [2]. Our model performance is comparatively better than
the performance of the Bayesian Network in similar settings. The underlying reason be-
hind this rise in performance is the HMM-LR in our approach is assigning probabilities
to predicted data points and BBN is computing conditional probabilities of the features
and then assigning probabilities. .

16
We extensively experimented with other metrics ,such as, Precision, Recall and AUC, which
have not been depicted in this work, that exhibit similar outcomes as shown in Fig. 7(a).
12 Sanket Mishra et al.

Fig. 7(b) depicts the statistical significance results of both the considered approaches
executed using 10-fold cross-validation. Our null hypothesis is that there is no statisti-
cal significance between both the approaches while the alternate hypothesis states that
there is a statistical significance that exists between both the approaches. Fig. 7(b) re-
jects the alternate hypothesis and proves that there is no statistical significance between
both the approaches and thus, justifying their uniqueness.

4 Conclusion
This paper outlines the implementation of a fused deep learning and probabilistic based
approach for event processing on IoT data streams. We investigated the data and iden-
tified the variations over different times of day and week. We identified the environ-
mental, social and temporal features are important factors in predicting congestion in
an ITS scenario. Hypothesis tests depict the statistical significance between considered
approaches. Our approach performed better than baseline approach. It is also a suitable
model for CEP integration in real-time congestion predictions on IoT streams.

References
1. Akbar, A., Khan, A., Carrez, F., Moessner, K.: Predictive analytics for complex iot data
streams. IEEE Internet of Things Journal 4(5), 1571–1582 (2017)
2. Akbar, A., Kousiouris, G., Pervaiz, H., Sancho, J., Ta-Shma, P., Carrez, F., Moessner, K.:
Real-time probabilistic data fusion for large-scale iot applications. Ieee Access 6, 10015–
10027 (2018)
3. Althelaya, K.A., El-Alfy, E.S.M., Mohammed, S.: Stock market forecast using multivariate
analysis with bidirectional and stacked (lstm, gru). In: 2018 21st Saudi Computer Society
National Computer Conference (NCC). pp. 1–7. IEEE (2018)
4. Cui, Z., Ke, R., Pu, Z., Wang, Y.: Deep bidirectional and unidirectional lstm recurrent neural
network for network-wide traffic speed prediction. arXiv preprint arXiv:1801.02143 (2018)
5. Kim, T., Kim, H.Y.: Forecasting stock prices with a feature fusion lstm-cnn model using
different representations of the same data. PloS one 14(2) (2019)
6. Kousiouris, G., Akbar, A., Sancho, J., Ta-Shma, P., Psychas, A., Kyriazis, D., Varvarigou, T.:
An integrated information lifecycle management framework for exploiting social network
data to identify dynamic large crowd concentration events in smart cities applications. Future
Generation Computer Systems 78, 516–530 (2018)
7. Mishra, S., Jain, M., Sasank, B.S.N., Hota, C.: An ingestion based analytics framework for
complex event processing engine in internet of things. In: International Conference on Big
Data Analytics. pp. 266–281. Springer (2018)
8. Pijoan, A., Oribe-Garcia, I., Kamara-Esteban, O., Genikomsakis, K.N., Borges, C.E.,
Alonso-Vicario, A.: Regression based emission models for vehicle contribution to climate
change. In: Intelligent Transport Systems and Travel Behaviour, pp. 47–63. Springer (2017)
9. Wang, Y., Cao, K., Zhang, X.: Complex event processing over distributed probabilistic event
streams. Computers & Mathematics with Applications 66(10), 1808–1821 (2013)
10. Yu, R., Li, Y., Shahabi, C., Demiryurek, U., Liu, Y.: Deep learning: A generic approach
for extreme condition traffic forecasting. In: Proceedings of the 2017 SIAM international
Conference on Data Mining. pp. 777–785. SIAM (2017)

You might also like