Thesis Azzagnuni Buckinx Wang

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 179

Crowd monitoring and density

forecasting using Wi-Fi


Sophia Azzagnuni
Timoer Buckinx
Yu Wang

Master thesis submitted under the supervision of


Prof. Dr. François Horlin
The co-supervision of
Prof. Dr. Philippe De Doncker

Academic year In order to be awarded the Master’s Degree in


2017-2018 Electronics and Information Technology
Engineering
Acknowledgments

First and foremost, we would like to express our gratitude to our promotor, Pr. Dr. Ir.
François Horlin, and co-promotor, Pr. Dr. Ir. Philippe De Doncker, for having given us the
opportunity to work on this one-year project, composed of a master’s thesis preceded by an
internship at the Brussels Major Events.

We would like to address our most deserved acknowledgments to our tutor, Dr. Ir. Jean-
François Determe, for his constructive comments, remarks and engagement throughout the
learning process of this master’s thesis.

Special thanks to BME team, and especially Mr. Anthony Fina, for his help regarding the
installation of the sensors on Plaisirs d’Hiver and for allowing us to perform experiments.

We would like to express our sincere gratitude to our family for having supported and
accompanied us during all these years of study and to whom we owe our success.

Last but not least, we would like to thank all the ULB and VUB professors and authorities.

i
Abstract

Understanding and measuring the dynamics of crowds have become an important research
issue in recent years. A large number of applications can benefit from this information,
such as the real-time management of people flow during large events, or the management
of scenes of disaster. Measuring crowd dynamics requires time-stamped localization data
of the people. There are multiple ways to gather people localization data. Detecting Wi-
Fi-enabled devices (such as smartphones) using scanners is the method stands out. This
method is preferred because it is relatively cheap to deploy, non-intrusive, and requires little
to no cooperation from the people that are being monitored. We proposed algorithms that
can predict information for organizers such as variation of crowd density in the next tens of
minutes. The main two commonly used techniques are time series forecasting (using, e.g.,
Auto-Regressive Integrated Moving Average models) and neural networks. The analysis of
the results lead us to conclude Auto-Regressive Integrated Moving Average models are the
preferred forecasting method to use and in particular seasonal Auto-Regressive Integrated
Moving Average methods. Neural network models have also proven to deliver promising re-
sults albeit with a lower prediction horizon than Auto-Regressive Integrated Moving Average
models.

Key words : Wi-Fi, Crowd density forecast, Seasonal Auto-Regressive Integrated Moving
Average, Neural networks.

ii
List of annotations

ACF Auto-Correlation Function


ADF Augmented Dickey-Fuller
AIC Akaike Information Criterion
AR Auto-Regressive
ARMA Auto-Regressive Moving Average
ARIMA Auto-Regressive Integrated Moving Average
BIC Bayes Information Criterion
CCTV Closed Circuit TV
CDF Cumulative Distribution Function
CID Company Identifier
DSSM Discrete State Space Model
GRU Gated Recurrent Unit
LSTM Long Short-Term Memory
MA Moving Average
MAE Mean Absolute Error
MAPE Mean Absolute Percentage Error
MGU Minimal Gated Unit
MSE Mean Squared Error
NIC Network Interface Card
OUI Organizationally Unique Identifier
PACF Partial Auto-Correlation Function
PSD Probability Density Function
ReLU Rectified Linear Unit
RMSE Root-Mean-Square Error
RNNs Recurrent Neural Network
RSSI Received Signal Strength Indication
SGD Stochastic Gradient Descent
SSE Sum of Squared Error
SSL Secure Socket Layer
SSNN State Space Neural Network

iii
Contents

Acknowledgments i

Abstract ii

List of annotations iii

1 Introduction 1

1.1 Events organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Approaches for carrying out crowd monitoring . . . . . . . . . . . . . . . . . 3

1.3 Crowd movements forecasting methods . . . . . . . . . . . . . . . . . . . . . 5

1.4 Structure of the master thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 System development 8

2.1 Global system architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Server/Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.1 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.2 Crowd movements between sensors . . . . . . . . . . . . . . . . . . . 23

2.4.3 Comparison with Proximus measurements . . . . . . . . . . . . . . . 25

iv
3 Theoretical background 28

3.1 Deterministic and stochastic processes . . . . . . . . . . . . . . . . . . . . . 28

3.1.1 First and second-order moments of stochastic processes . . . . . . . . 29

3.2 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.2 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.3 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Correlation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Autocorrelation function . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.2 Partial autocorrelation function . . . . . . . . . . . . . . . . . . . . . 33

3.4 Some operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Assessing stationarity and transformations . . . . . . . . . . . . . . . . . . . 35

3.6 Concepts of underfitting and overfitting . . . . . . . . . . . . . . . . . . . . . 37

4 Univariate non-seasonal forecasting model 38

4.1 AR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 MA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Duality between AR and MA . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4 ARIMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4.1 Identifying an ARIMA model . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.5.1 Model identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.5.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.5.3 Diagnostic checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

v
5 Seasonal ARIMA model 60

5.1 Seasonal time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2 Mathematical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3 Additive and multiplicative seasonality . . . . . . . . . . . . . . . . . . . . . 62

5.4 Identifying a seasonal model . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.5.1 Model identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.5.2 Model estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.5.3 Diagnostic checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.5.4 Prediction for sensor 7 . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.6 Comparison with another sensor . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.7 Comments on seasonal ARIMA models . . . . . . . . . . . . . . . . . . . . . 85

6 Feedforward Neural Networks 87

6.1 Introduction to neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.3 Optimization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.3.1 Configuring the hyperparameters . . . . . . . . . . . . . . . . . . . . 94

6.4 Experiments on time series data . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4.1 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.4.4 Batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.4.5 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

vi
7 Recurrent Neural Networks 109

7.1 Recurrent Neural Network Models . . . . . . . . . . . . . . . . . . . . . . . . 109

7.2 RNNs training process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.2.2 Forward propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.2.3 Cost function calculation . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.2.4 Backward propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.3.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.3.2 RMSprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.3.3 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.4 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7.4.1 Difficulties faced by deep RNNs . . . . . . . . . . . . . . . . . . . . . 123

7.4.2 LSTM cell structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.5.1 Training and Testing Data . . . . . . . . . . . . . . . . . . . . . . . . 126

7.5.2 Parameter setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.5.3 Experiments of different LSTM network architectures . . . . . . . . . 128

7.5.4 Weights and biases initialization . . . . . . . . . . . . . . . . . . . . . 133

7.5.5 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.5.6 Prediction results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.5.7 Impact of input size . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.5.8 Comments on LSTM model . . . . . . . . . . . . . . . . . . . . . . . 139

7.6 Gated Recurrent Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7.7 Variants of GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.8 Results with GRU Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.8.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

vii
8 Analysis of the presented forecasting techniques 154

9 Conclusion 156

References 157

viii
List of Figures

1.1 Crowd density/flow relationship with standing/walking visitors . . . . . . . . 2

2.1 Developed system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Developed sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 MAC address composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Structure of phdata table in MySQL database. . . . . . . . . . . . . . . . . . 15

2.5 Part of the received measurements in MySQL database. The measurements


are from December 26 at Place de la Monnaie. . . . . . . . . . . . . . . . . . 16

2.6 Main Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 Number of MAC addresses detected by each sensor in real-time . . . . . . . 18

2.8 An example of heat map for Plaisirs d’Hiver . . . . . . . . . . . . . . . . . . 18

2.9 Web page for detailed information query . . . . . . . . . . . . . . . . . . . . 19

2.10 Web page for daily line chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.11 Example of consecutive MAC addresses detected on the same electronic de-
vice. The MAC address is split into two parts in database: the OUI and NIC
parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.12 Beacon time interval calculation . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.13 Cumulative Density Function of beacon time interval . . . . . . . . . . . . . 22

2.14 Measurements made by sensor 1 (Place de la Bourse) on December 23 . . . . 23

2.15 Antenna coverage by Proximus . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.16 Comparison between Data on December 27 . . . . . . . . . . . . . . . . . . . 26

ix
2.17 Comparison between Data on December 27 with an extrapolation factor 1.55 27

3.1 Underfitting and overfitting models . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Two examples of data from AR models with different parameters . . . . . . . 39

4.2 Two examples of data from MA models with different parameters . . . . . . 41

4.3 Box-Jenkins technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 Number of unique detected MAC addresses at Marché aux Poissons (sensor
5) on December 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.5 Result of differentiated logarithm transformed data . . . . . . . . . . . . . . 51

4.6 The ACF and PACF of the transformed data . . . . . . . . . . . . . . . . . . 51

4.7 Result by applying ARIMA (0,1,1) on measurements at Marché aux Poissons


on December 27. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.8 ACF and PACF of the residuals . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.9 Summarized proceeding to achieve forecasting based on ARIMA . . . . . . . 55

4.10 Forecast results at Marché aux Poissons (sensor 5) on December 28 . . . . . 57

4.11 Prediction results on Place de la Bourse and Place de la Monnaie using


ARIMA model developed from Marché aux Poissons on December 27. The
prediction horizon is 9 minutes in both cases. . . . . . . . . . . . . . . . . . 59

5.1 Respectively additive seasonality and multiplicative seasonality . . . . . . . . 63

5.2 Number of unique detected MAC addresses at Grand-Place for measurements


made from 9:30 to 23:30 on December 22, 27, 29 and 30. . . . . . . . . . . . 66

5.3 Decomposition of data shown on Figure 5.2 into the seasonal, trend and
remainder patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4 ACF and PACF of logarithm transformed training measurements . . . . . . 68

5.5 ACF and PACF of differentiated logarithm transformed data . . . . . . . . . 70

5.6 ACF and PACF of residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.7 Residual Q-Q plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

x
5.8 Forecast result at Grand-Place with prediction horizon of 30 minutes . . . . 75

5.9 Forecast result at Grand-Place with prediction horizon of 20 minutes . . . . 77

5.10 Forecast result at Grand-Place with prediction horizon of 15 minutes . . . . 78

5.11 Forecast result at Grand-Place with prediction horizon of 10 minutes . . . . 78

5.12 Forecast results for Scenario 1 (week days) . . . . . . . . . . . . . . . . . . . 81

5.13 Scenario 2 (weekend days) - Forecast results for sensor 1 . . . . . . . . . . . 83

6.1 Perceptron structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.2 Network of perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.3 Sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.4 Hard sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.5 Hyperbolic tangent function . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.6 Rectified Linear Unit function . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.7 Learning rate in function of iterations . . . . . . . . . . . . . . . . . . . . . . 93

6.8 Loss in function of learning rate . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.9 Comparison of different optimizer algorithms based on the MNIST digit clas-
sification for 50 epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.10 Accuracy of different activation functions in function of neurons . . . . . . . 98

6.11 Accuracy of different dropout rates corresponding to a weight constraint . . 100

6.12 Feedforward network with a prediction horizon of 10 minutes, with 1 hidden


layer and 50 neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.13 Feedforward network with a prediction horizon of 10 minutes, with 1 hidden


layer and 50 neurons, zoomed in . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.14 Losses of feedforward NN with a prediction horizon of 10 minutes and with


1 hidden layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.15 Comparison of feedforward neural network with a prediction horizon of 10


minutes and with multiple hidden layers . . . . . . . . . . . . . . . . . . . . 104

xi
6.16 Losses of FNN with increasing neurons . . . . . . . . . . . . . . . . . . . . . 105

6.17 FNN forecast with 100 neurons in the hidden layer and a prediction horizon
of 10 minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.18 FNN forecast with 100 neurons in the hidden layer and a prediction horizon
of 10 minutes zoomed in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.1 RNNs basic structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2 RNNs unrolled structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.3 Three types of RNNs structure . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.4 An example of cost function in 3D space . . . . . . . . . . . . . . . . . . . . 113

7.5 A RNN cell structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.6 The backward propagation at time step t for a RNN cell . . . . . . . . . . . 117

7.7 Gradient descent convergence diagram . . . . . . . . . . . . . . . . . . . . . 118

7.8 RMSprop convergence diagram . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.9 LSTM Cell structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.10 Validation loss of different models with one hidden layer . . . . . . . . . . . 128

7.11 Validation loss of different models with 2 hidden layers . . . . . . . . . . . . 129

7.12 Validation loss of different models with total number of neurons 120 . . . . . 131

7.13 Validation loss of different models with a fixed ratio . . . . . . . . . . . . . . 132

7.14 The training result of our model under three optimization algorithms . . . . 134

7.15 Full prediction results of LSTM model on December 31 by all sensors . . . . 136

7.16 Prediction results of LSTM model on December 31 at Place de la Bourse . . 136

7.17 Prediction results of LSTM model on December 31 at Place de la Monnaie . 137

7.18 Online machine learning with a sliding window for time series forecasts . . . 138

7.19 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7.20 GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

xii
7.21 GRU variant models made based on MNIST row-wise generated sequences
having a batch size of 32 and 100 hidden units . . . . . . . . . . . . . . . . . 144

7.22 Grid search to compare increasing neurons with different activations . . . . . 145

7.23 GRU Network with 2 hidden layers, each containing 50 neurons and a pre-
diction horizon of 10 minutes . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.24 GRU Network with 2 hidden layers, each containing 50 neurons and a pre-
diction horizon of 10 minutes, zoomed in . . . . . . . . . . . . . . . . . . . . 146

7.25 Losses of GRU Network with 2 hidden layers, each containing 50 neurons . . 147

7.26 Losses of GRU Network with 2 hidden layers, each containing 10, 50 or 100
neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.27 Losses of GRU Network, data shuffled every epoch . . . . . . . . . . . . . . . 149

7.28 Losses of GRU Network, data not shuffled every epoch . . . . . . . . . . . . 149

7.29 Zoomed in GRU Network, data shuffled every epoch . . . . . . . . . . . . . . 150

7.30 Zoomed in GRU Network, data not shuffled every epoch . . . . . . . . . . . 150

7.31 Comparison of losses of networks with multiple GRU layers and a prediction
horizon of 10 minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

xiii
List of Tables

2.1 Sensors’ locations during Plaisirs D’Hiver . . . . . . . . . . . . . . . . . . . . 17

2.2 Crowd movements between sensors on December 31 from 16:00 to 17:00 . . . 24

2.3 Ratio of number of MAC addresses detected once to total detected MAC
addresses on December 31 from 16:00 to 17:00 . . . . . . . . . . . . . . . . . 25

4.1 Duality in correlation patterns for AR and MA processes . . . . . . . . . . . 42

4.2 Examples of ARIMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Summary of correlation patterns . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Descriptive statistics of the raw measurements at Marché aux Poissons on


December 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.5 ADF test results for the transformed data at Marché aux Poissons (sensor 5)
on December 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.6 ARIMA model estimation results . . . . . . . . . . . . . . . . . . . . . . . . 52

4.7 The ADF and Box-Ljung tests results of the residuals of prediction results at
Marché aux Poissons on December 27 . . . . . . . . . . . . . . . . . . . . . . 54

4.8 The ARIMA models for different event areas . . . . . . . . . . . . . . . . . . 56

5.1 Summary of seasonal correlation patterns . . . . . . . . . . . . . . . . . . . . 64

5.2 Time intervals and their corresponding periodicity for sensor 7 (Grand-Place).
Measurements made from 9:30 to 23:30 on December 22, 27, 29 and 30. . . . 65

5.3 Descriptive statistics of the raw measurements made at Grand-Place from


9:30 to 23:30 on December 22, 27, 29 and 30. . . . . . . . . . . . . . . . . . . 66

xiv
5.4 Augmented Dickley-Fuller test results for transformed data (d = 1 and D = 1) 69

5.5 Results of seasonal ARIMA model estimation . . . . . . . . . . . . . . . . . 71

5.6 Coefficients of seasonal ARIMA model(1,1,1)×(0,1,1)28 . . . . . . . . . . . . 71

5.7 Box-Ljung and ADF tests results for residuals of seasonal ARIMA model(1,1,1)×(0,1,1)28 74

5.8 Seasonal ARIMA models for different time intervals . . . . . . . . . . . . . . 77

5.9 Errors between the predicted and observed measurements made on Saturday
30/12 from 9:30 to 23:30 at Grand-Place . . . . . . . . . . . . . . . . . . . . 79

5.10 A scale of judgment for forecast accuracy . . . . . . . . . . . . . . . . . . . . 79

5.11 Scenario 1 (week days) - Seasonal ARIMA models for sensor 1 . . . . . . . . 80

5.12 Errors between the predicted and observed measurements made on Friday
29/12 from 3:00 to 14:00 at Place de la Bourse . . . . . . . . . . . . . . . . . 82

5.13 Seasonal ARIMA models at Place de la Bourse for scenario 2 (weekend days) 82

5.14 Scenario 2 (weekend days) - Errors between the predicted and observed mea-
surements made on Sunday 31/12 from 3:00 to 14:00 at Place de la Bourse . 84

6.1 Network Structure of feedforward network with 1 hidden layer . . . . . . . . 103

6.2 Network Structure of feedforward network with 2 hidden layers . . . . . . . . 103

6.3 Network Structure of feedforward network with 3 hidden layers . . . . . . . . 103

6.4 MAE losses of a feedforward neural network for different amount of hidden
layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.5 MAE losses of a feedforward neural network with 2 hidden layers, a prediction
horizon of 10 minutes for different amounts of neurons (in the hidden layer) . 107

7.1 Training, validation and test sets . . . . . . . . . . . . . . . . . . . . . . . . 126

7.2 Result of LSTM model structure with one hidden layer . . . . . . . . . . . . 129

7.3 Result of model structure with two hidden layers . . . . . . . . . . . . . . . 131

7.4 The architecture of the proposed LSTM Network . . . . . . . . . . . . . . . 133

7.5 Results of proposed LSTM model using different initialization methods . . . 133

xv
7.6 Results of different prediction horizons of proposed LSTM model on Decem-
ber 31 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.7 MAE of different Plaisirs d’Hiver event areas . . . . . . . . . . . . . . . . . . 137

7.8 Results of LSTM models different input sizes . . . . . . . . . . . . . . . . . . 139

7.9 MAE losses of a GRU network for different amount of neurons in its 2 hidden
layers and a prediction horizon of 10 minutes . . . . . . . . . . . . . . . . . . 148

7.10 Network Structure of GRU Network with 10 neurons in the hidden layers . . 148

7.11 Network Structure of GRU Network with 50 neurons in the hidden layers . . 148

7.12 Network Structure of GRU Network with 100 neurons in the hidden layers . 148

7.13 Network Structure with 3 hidden GRU layers . . . . . . . . . . . . . . . . . 151

7.14 Network Structure with 4 hidden GRU layers . . . . . . . . . . . . . . . . . 151

xvi
1. Introduction
As larger and larger events are organized, the need to understand crowd dynamics becomes
more important. A collection of individuals is considered a crowd. [S13] If crowd move-
ments are not taken into consideration, several maleficent consequences are bound to occur.
One of the most important and known results can be overcrowding, leading to a host of
terrible effects. However, all these things can be prevented by implementing monitoring and
forecasting techniques on crowd movements. These methods can provide invaluable infor-
mation to the event organizers to take immediate action. The crowd dynamic will thereby
change and thus mitigate the effects of overcrowding for example. [S13]

1.1 Events organization

Security has become a major concern for events organization since significant crowd disasters
occurred during events in the last few decades [GK18] :

• 1985. Brussels, Belgium: 39 fans die in rioting when escaping fans were crushed
against a collapsing wall at Heysel Stadium.

• 1989. Hillsborough, Sheffield: 96 die and 200 injured after a crowd surge crushes fans
against barriers.

• 1996. Guatemala City, Guatemala: 84 people died and about 150 others were injured
during a stampede.

• 2002. Yokohama, Japanese Mall Event: Crowd Craze with 10 injured.

• 2004. Jamarat, Saudi Arabia: 249 pilgrims crushed and 252 injured.

• 2004. Beijing, China: 37 dead and 15 injured in crowd rush.

• 2009. Birmingham (JLS), UK: 60 injured and 4 hospitalized.

1
• 2010. Duisburg, Germany: 21 people die in a stampede at Germany’s Love Parade
music festival. The crush happened when hundreds of thousands of people tried to
squeeze through a narrow tunnel that serves as the only access to the grounds.

• 2011. Kerala, India: 102 pilgrims killed in stampede at an Indian festival.

• 2015. Mina Valley, Saudi Arabia: more than 1500 dead and thousands injured.

• 2016. Falls festival, Australia: 60 injured on egress.

• 2017. Turin, Italy: more than 1500 fans were injured during a stampede resulting
from a loud noise confused with a terrorist attack.

As it can be seen from those examples, main disasters are due to badly design events, which
resulted into overcrowding (too many people and not enough space). Therefore, even more
in the actual fear context due to terrorist attacks, organizers are not only worried about an
explosion but are mainly concerned about the consequences of rushing crowds. The impact
of crowd density, meaning the number of people per square metre, for a standing crowd and
a moving crowd is then important to understand for crowd safety.

Figure 1.1: Crowd density/flow relationship with standing/walking visitors. Source [GK18]

Figure 1.1 presents a 100 m2 space increasing from 1 person/m2 to 10 people/m2 and
illustrates how the crowd may look in low and high risks situations. The main safety rules
for events are to not exceed 2.7 people/m2 and to get at least 1m exit width/100 people.

2
This figure also provides the relationships between crowd density (people/m2 ) and crowd
rate (people/m/min); as density increases the crowd flow rate drops. Besides, it can be
demonstrated the shockwave for a static crowd becomes dangerous at densities greater than
5 people/m2 , and that a moving crowd becomes unstable above 3 people/m2 . [S13] When
planning an event, the organizers need to understand the risks of overcrowding; how and
where overcrowding may occur as well as how to prevent dangerous overcrowding from
developing. Maintaining crowd density below the mentioned levels will reduce the risk of
accidents/incidents.

In order to deal with overcrowding, some calculations could be performed in advance, like
how much area will be available to the crowd, how the area will be used, where the entry
points will be located and how quickly the area will be filled with people. Regarding the last
point, it is also possible to compute the number of people attending to the event in real-time
and predict the arrival of crowds in a region such that the organizers know the crowd flows
will exceed the region’s safe capacity. In that case, they can launch some emergency mech-
anisms, for instance sending warnings to people and conducting traffic controls. [HZS16]
The risk of serious injury can then be foreseen. Failing to provide a safe environment for the
crowd - failing to anticipate the crowd numbers, crowd arrival rates and crowd throughput
rate - is a common element linking many disasters. [S13]

1.2 Approaches for carrying out crowd monitoring

Closed Circuit TV systems, also known as video surveillance, are a first way to monitor
events; professionals observe control screens and inform organizers when they detect ab-
normal situations. This system is appealing since cameras are most often already installed
through cities. Unfortunately, this solution has several drawbacks. Surveillance cameras
could for instance be subject to occlusions, whereas it remains a difficult challenge to obtain
a full coverage of the area, that is to say without "holes". Alternatively to visual mon-
itoring done by professionals, automatic video analysis has also been developed but has
some disadvantages. When the crowd density is high, video analysis has more difficulties
to discriminate individuals and to reconstruct movements through multiple cameras, for

3
all weather conditions. [VNDW12] Furthermore, when facial recognition is performed as
detection method, serious problems regarding privacy have to be dealt with. [MR17]

To circumvent these drawbacks, another option to perform crowd monitoring can be explored
with the use of signals emitted by attendees’ device(s). The procedure is based either on
signals sent by smartphones towards cellular base stations or on signals sent by comple-
mentary embedded technologies like Bluetooth or Wi-Fi. The first approach is routinely
carried out by cellular operators that are currently localizing their users. This localization
was usually done using call data records, but many operators have now access to near real-
time network signaling data which captures all connections made from the device to the
network – over all technologies (2G, 3G and 4G). The great advantage of this technology
is coming from the installed network of base stations that covers all geographical areas
without holes. Its limitation is due to the sparse emissions by smartphones (only at each
transaction with the network or every three hours) and to the important localization uncer-
tainty, typically around 300m in cities. [WFR13] On the other hand, the process of using
Bluetooth or Wi-Fi signals is also widespread. The idea is to detect signals that are sent
by these technologies when trying to discover or to connect to access points intentionally
placed in the analysis area. The main advantage comes from localization uncertainty that
depends on the number of access points, whereas the principal drawback is related to the
infrastructure of sensors that has to be installed, so limiting the analysis to a restricted
area. Furthermore, it must be mentioned that the system does not take into account people
with Bluetooth/Wi-Fi device(s) turned off; therefore, an extrapolation factor must be used
to pass from the number of attendees counted by the sensors to the effective number of
people present in the event. Still, the reliability of this system should only increase with
the growth of installed Wi-Fi access points through the cities. Many companies are today
specialized in offering localization systems; those firms include Amoobi, a Belgian spin-off of
the partner Wireless Communications Group, that relies on the Bluetooth and Wi-Fi signals
transmitted by smartphones. However, instead of focusing on crowd monitoring, the intent
of these companies is closely linked to market intelligence: the analysis of collected data
is realized a posteriori in order to deduce useful information regarding visitors’ behavior
for retailers or event organizers, who can then deduce the typical profile of their visitors,

4
so as to efficiently organize the future event area. In the context of large public events,
Bluetooth localization was used in Gent in 2011 to monitor crowds during Gent festivities
[VNDW12], and Wi-Fi localization during a rock festival in 2015 in Assen, Netherlands.
[CDP16] The use of dedicated Apps running on visitor smartphones was also explored, in
2011 in London [WFR13] but it was too intrusive to be generalized to any event. All those
attempts look successful, but they were not able to monitor and anticipate crowd behavior
in a safety context. In other words, these two smartphone localization approaches, cellular
and Bluetooth/Wi-Fi, have their own limitations that make them complementary: cellular
crowd monitoring involves large amount of data that have to be processed in the shortest
delay possible (current reporting solutions are batch oriented) and provides crude localiza-
tion accuracy, while Wi-Fi crowd monitoring cannot be installed at large scale (the whole
Brussels Region for instance) and needs robust cellular communication links to transmit
in real-time the measurements. Therefore, a reliable crowd monitoring system should be
achieved by the use of those two technologies; for example in the Brussels centre, cellular
operators could be used to determine crowd density at large-scale on the main avenues of the
pentagon area while Wi-Fi technology would locally characterize crowd densities at strategic
convergence points. Those combined systems should then allow to anticipate overcrowding
very efficiently, giving time to the organizer to react, for instance, by redirecting people to
less crowded streets.

1.3 Crowd movements forecasting methods

Crowd dynamics forecasting based on smartphones signals, is, however, in its infancy. To
process and forecast crowd movement data, several algorithms exist. Currently the method
with the best results to do this is the Auto-Regressive algorithm. This can be partly at-
tributed to the fact that it can exploit the space and time elements in the data. Another
part of crowd forecasting is traffic forecasting where attributes of traffic are analyzed. One
important part of traffic forecasting is travel time. Methods to predict these kinds of data
include Auto-Regressive Moving Average technique and neural networks. Furthermore an
excellent method to forecast travel time is achieved by using a state space neural network
which is part of a discrete state space model. [ZQL17]

5
Another technique that is used is the Moving Average model. This method has proven to
present good results regarding data which could contain missing samples (due to unreliable
connection of a sensor for example). Neural networks can be trained in such a way (by
removing some sensors in the data) so that it can exhibit a similar behavior when enough
neurons are present [ZQLYL17]. However, time series techniques usually provide better
results and have a preference over building a neural network to forecast time series data.

Often, an Auto-Regressive Integrated Moving Average model is used which introduces an


integrative part on the model. As for neural network improvements on the forecast can be
made by implementing a Recurrent Neural Networks.

1.4 Structure of the master thesis

In Chapter 2, we detail the developed crowd monitoring system based on Wi-Fi technology
and dedicated to safety. This section focuses on the hardware, with a description of the
main components, as well as the software by detailing the process (and algorithms) imple-
mented on the sensors. Furthermore, the design of the server/website has been laid out. The
way we decided to process the data is explained and justified. The results obtained during
Plaisirs d’Hiver are then compared with the Proximus data to prove the accuracy of our
designed sensors. Next, in Chapter 3, necessary background theory introduces the theoret-
ical knowledge required to understand the design of forecasting algorithms. The difference
between deterministic and stochastic processes is explained as well as an overview of the
most important statistical moments. Afterwards the definition of a time series is presented
and its multiple components are discussed. Equipped with the theoretical background, the
following chapters expand more on a number of advanced forecasting techniques applied on
the data we gathered from our sensors at Plaisirs d’Hiver. Chapter 4 focuses on producing
a univariate non-seasonal forecasting model and how to identify the most suitable ARIMA
model. Chapter 5 introduces the concept of seasonal ARIMA models and how to include
this component in the ARIMA model. This section concludes with the impact of season-
ality when performing forecast in the specific context of crowd monitoring. Starting from
Chapter 6, forecasting algorithms using neural networks are implemented. At first, after

6
the basic background theory has been given, simple feedforward neural network structures
are presented and analyzed. Recurrent neural network are introduced in Chapter 7 and
explained how these can be used as artificial intelligence strategy to achieve predictions of
crowd densities. A further distinction is made in this type of network based on whether it
is implemented with an LSTM or a GRU layer. Those multiple methods are then compared
in Chapter 8 based on their accuracy and computational complexity. Lastly, a conclusion
is drawn in Chapter 9.

7
2. System development
Before developing prediction algorithms (based on univariate time series models and neural
networks), a detailed explanation is presented about how we gathered data during the event
"Plaisirs d’Hiver". The global structure of the system of sensors with central server is laid out
and the several hardware components discussed. The software process of a sensor gathering
the measurements and sending them to the central server is detailed. Afterwards, we detail
how this data is stored on the central server and displayed on the website, for instance
in the form of a heat map. Before having usable data to perform forecasting techniques
on, the raw data has to be processed. Therefore a decision has to be made on choosing a
time interval range to count the crowd density at each time instant. This parameter choice
is thoroughly discussed. Next, an analysis is made regarding crowd movements between
sensors and noticed that there are fluxes present. Lastly, we compare our data with the
data gathered by Proximus.

2.1 Global system architecture

Figure 2.1: Developed system

8
Each attendee’s device (smartphone, laptop or tablet) with Wi-Fi turned on sends out probe
requests at regular time intervals in order to scan potential access points. The probe request
includes a source MAC address which is a unique identifier provided by the manufacturer
of the device and allows then to uniquely distinguish a device among others. As depicted
on Figure 2.1, the developed system comprises several sensors that intercept probe requests
and extract their source MAC addresses. The measurements are gathered in bundles to
ensure efficiency throughput when sending to the central server, where the data are firstly
processed and then added to the database. A website provides organizers with real-time
information, like the number of attendees.

2.2 Hardware

Figure 2.2: Developed sensor

Figure 2.2 displays the architecture of a sensor, which comprises :

1) a raspberry pi 3 model B that can be seen as a small single-board computer

2) a wireless USB adapter, which has a gain of 4 dBi indicating the detection range is
approximately 100 meters without obstacles. This stick provides Wi-Fi and has a
speed of 150 Mbps.

9
3) 3G USB dongle, providing a 3G connection between the sensor and the central server,
is used as a modem.

4) a power supply

In this project, the wireless USB adapter is put to monitor mode. There actually exist four
possible modes for a 802.11 wireless card: master, ad-hoc, managed and monitor. When a
wireless card is in master mode, it acts as an access point and actively transmits a signal.
While in master mode, wireless cards manage all communications related to the network
(like authenticating wireless clients, handling channel contention and repeating packets),
an ad-hoc mode creates a multipoint-to-multipoint network with no requirement of an
access point. In practice, this mode is exclusively used when you want to connect several
computers together towards Wi-Fi without passing by a router. Therefore, each wireless
card communicates directly with its neighbors. Nodes must be in range of each other to
communicate, and must agree on a network name and channel. Furthermore, the notions
managed and monitor modes are opposed; managed mode is sometimes also referred
to as client mode. Wireless card in managed mode joins a network created by a master,
and automatically adapts its channel. It then presents any necessary credentials to the
master, and if those credentials are accepted, the wireless card and the master are said
to be associated. Managed mode cards do not communicate with each other directly, and
only communicate via an associated master. [WIR07] On the other hand, monitor mode
allows us to listen to the whole wireless traffic. In distinction with the promiscuous mode
which is also used to sniff the network, the monitor mode captures packets without joining
a network.

2.3 Software

2.3.1 Sensor

The sensor purpose is to monitor in real time the number of attendees in a specific detection
range. The raspberry pi executes codes to configure Wi-Fi dongle in monitor mode, cap-

10
ture packets, extract MAC addresses and eventually upload the measurements to the server.

• MAC Addresses

The IEEE 802.11 standard is designed for wireless local area network connections and pro-
vides connectivity for electronic devices such as smartphones, tablets and laptops. The
uniqueness of MAC address can represent each electronic device. The 802.11 management
frames are of three different types [BBQL13] :

1. Probe Requests are used by Wi-Fi devices to scan potential access points. When a
Wi-Fi device is not connected to an access point, it will send out requests at regular
time intervals to search for access points. Besides, modern cell-phones also send out
regular probe requests even if they are already connected to an access point as they
could reach access points with a stronger signal than the currently connected access
point.

2. Association Request is sent when the device wants to connect to a certain access point.

3. Reassociation Request is sent when a Wi-Fi device wants to connect to another access
point on the same network.

Those packets are not sent continuously, for example, probe requests are sent every 30-60
seconds, sometimes even longer if the device is connected to an access point. [BBQL13]
What is more, the device could send a burst of probe requests.

As shown on Figure 2.3, a MAC address


comprises two distinct parts; the first
three bytes correspond to Organization-
ally Unique Identifier, which provides
information about the manufacturers of
network adapters. For instance, Dell
Figure 2.3: MAC address composition
has 00-14-22 and Cisco 00-40-96. Be-
sides, the larger manufacturers of networking equipment usually have more than one set of

11
OUI. On the other hand, the last three bytes refer to Network Interface Card and uniquely
identifies a Wi-Fi device. [MMD17]

Traditionally, electronic devices broadcast probe requests with global unique MAC address
to all nearby access points, and then it sets connection with one appropriate access point.
As it creates privacy issue by broadcasting its unique identity, more and more manufactures
allow devices to use locally random MAC addresses when trying to find access points. The
randomization of MAC addresses is performed only in disassociated state. Indeed, under
associated state, it is still using its unique identifier. To randomize MAC address, device
uses Company Identifier as three-byte prefix instead of OUI. [MMD17] CID can be bought
from IEEE, for example, Google owned DA:A1:19 CID. From our measurements on De-
cember 29, among 1185151 data, there are 105428 data with DA:A1:19 as first three bytes
(approximately 10%). This randomization creates an impact on data processing, as it adds
difficulty to analyze the relationship between the number of unique detected MAC addresses
and the effective number of attendees present to the event.

• Data transmission

Algorithm 1 Code running on every raspberry


Result: synchronize time, detect beacon signals and add to buffer
getCurrentTimeFromServer()
import scapy
start(sendThread)
buffer = [ ]
while not interrupted do
if packetHasDot11 then
addToBuffer(ID, OUI + hash (NIC + pepper), RSSI, time)
end
end

Pseudocode 1 provides the algorithm for capturing and sending MAC addresses to the cen-
tral server. Firstly, as the raspberry pi is not equipped with time, the actual time is fetched
from the server. Note, adjustment is made on this downloaded time to make sure it matches
with Brussels GMT. The sniff process is done by importing python-scapy library. In this
project, we only deal with probe requests packets (Dot11 with type 0 and subtype 4) and

12
we extract from those packets the device’s MAC address and the Received Signal Strength
Indication. The measurements are gathered in bundles to minimize the impact of protocol
overhead when sending to the server via a TCP socket and are composed of the sensor iden-
tity, OUI + hashed (NIC + pepper), RSSI and time at which the packet was captured by
the sensor. It should be noticed the RSSI value could in theory be used to further increase
the location determination precision. From this value, an estimation could be made on the
distance between the sensor and the device sending out the packet. However, empirical tests
have shown the RSSI value was almost not used in mass events that gather a high amount
of electronic devices and people due to severe fluctuations as well as noise in the data sets.
Because of these environmental factors, the RSSI value is currently not used in the detection
mechanism [BBQL13] but for being consistent, we preferred to keep this information. It
can be noticed a particular attention was carried out to privacy aspect. As the second part
of the MAC address allows to uniquely identify the device’s owner, the last three bytes were
not directly stored in the database; NIC was firstly concatenated with a pepper, a specific
sequence, to increase the safety when performing the hashing. The whole sequence was then
hashed with SHA256, a secure hash algorithm, such that for a given output, it is impossible
to retrieve the input.

Algorithm 2 introduces threads sending the measurements from the raspberry to the server.
This algorithm periodically (every 30 seconds for efficiency) performs a check to see if there
is data in the buffer. If there is, a very simple preprocessing step is performed, which does
not burden the sensor with heavy calculations.

13
Algorithm 2 Details of the send thread
Result: periodically preprocess and send data to server
while not interrupted do
if buffer is not empty then
createHTTPPOSTRequest()
newbuffer = [ ]
for MACaddress in buffer do
if MACaddress not in newbuffer then
addToNewbuffer(MACaddress)
end
end
send(newbuffer)
buffer = [ ]
end
sleep(30sec)
end

This preprocessing consists of making a new buffer and only adding data with MAC ad-
dresses that are not residing in the new buffer yet. In other words, only the first probe
request of a specific MAC address is added to the new buffer. It can happen that there are
a couple of duplicate data samples of the same MAC address (0-29 seconds apart) detected.
This simple algorithm ensures that those duplicates are removed from the buffer (it does not
make sense to keep all the duplicates since this would make our database large). Afterwards,
the new buffer containing data samples with unique MAC addresses is sent to the server.
Lastly, the buffer is cleared.

2.3.2 Server/Website

Plaisirs d’Hiver takes place from the last week of November until December 31, at Grand-
Place in Brussels and around Bourse, Place de la Monnaie, Place Sainte-Catherine and
Marché aux Poissons. This annual event attracts tens of thousands of visitors from all
over the world. Therefore, security is an essential factor that has to be ensured during
event. Web hosting services are provided by OVH cloud computing company1 . It has
two purposes in this project; Firstly, it receives measurements gathered by different sensors.
After authentication and anonymization, the measurements are stored properly in a MySQL
database. Secondly, it displays real-time results on website. The website and database are
1
The website designed for Plaisirs d’Hiver is available at https://www.bmebigdata.be/

14
fully used in Plaisirs d’Hiver.

2.3.2.1 Data acquisition

Once a sensor gathers a certain amount of measurements, it sends a POST request to the
server under Secure Socket Layer connection. The server first checks the integrity of the
submitted data, so as to make sure the full data package is received. Then it anonymizes
the data for a second time, while the first time is done at the sensor side, before the
measurements are sent. The second encryption combines hashing function and pepper,
where the pepper changes every hour during the event days. The implementation of the
pepper can further improve information security however, against evil attacks such as a
pre-computed rainbow table attack this defense mechanism is not entirely optimal.

Figure 2.4 shows the complete structure of table in database, where sensor ID and RSSI
are of data type int, MAC address and time are of data type varchar. Figure 2.5 shows an
example of measurements stored in database.

Figure 2.4: Structure of phdata table in MySQL database. The table contains an auto-
increment key (id), the identifier of each sensor (sensor ID), MAC address (separated into
two parts: macoui and machash), RSSI and time stamp of detected probe request (time).

15
Figure 2.5: Part of the received measurements in MySQL database. The measurements are
from December 26 at Place de la Monnaie.

2.3.2.2 Data display

Several interfaces are designed with different purposes. On the main page, a map of the
whole event area is displayed, where the red dots with numbers indicate the location of
sensors; sensors are placed at the main event areas as depicted in Table 2.1: Grand-Place,
Place de la Monnaie, Place de la Bourse and Marché aux Poisson. Besides, we decided to
put two sensors at Bourse (sensors 1 and 3). All sensors are placed high above the ground, so
that they cannot be touched by passengers. What is more, each sensor is kept in a resistant
and waterproof box, to prevent it from being damaged under bad weather conditions.

16
Figure 2.6: Main Page

Sensor Number Location

1 Place de la Bourse
2 Place de la Monnaie
3 Rue du Midi
4 Rue Sainte-Catherine
5 Marché aux Poissons
6 Rue de l’Evêque
7 Grand-Place

Table 2.1: Sensors’ locations during Plaisirs D’Hiver

Figure 2.7 shows a window that we designed, with a table of crowd numbers detected by each
sensor for the last 10 minutes. The numbers are automatically refreshed every 30 seconds.
Besides providing crowd information, these numbers can help us to see which sensors are
working and which ones are not. If it stops working for some reason, we can quickly check
and fix the situation.

17
Figure 2.7: Number of MAC addresses detected by each sensor in real-time

Figure 2.8: An example of heat map for Plaisirs d’Hiver

Another way to illustrate our real-time measurements is the heat map. Based on Google
Maps API and the measurements, a heat map is generated to display crowd densities over
the entire event area. Color variations reveal crowd densities: the extreme colors are red,
indicating a high density, and green, that refers to a relatively low density. Since the Google
Maps application requires the longitude and latitude of a point to place it on the map, a
list of (latitude, longitude) points is generated for each event area and street. Besides, each

18
point placed on the map represents a certain number of unique detected MAC addresses.
For example, as we decided to impose 50 visitors/dot, if sensor detects 1000 distinct MAC
addresses at Grand-Place, 20 points are selected from the set dedicated to Grand-Place area
and finally placed on the map.

Figure 2.9 shows a data query system, where user can search for detailed data collected
by each sensor. By selecting manually the start time, end time and event area, detailed
information is listed in Table 2.9.

Figure 2.9: Example of the detailed information shown on web page from December 27,
information contains sensor ID, MAC address, RSSI and time.

On the home page, line charts provide a qualitative global view of measurements for each
event area. The horizontal axis represents the time, from 0 to 24 hours, and vertical axis is
crowd number, number of distinct MAC addresses detected by sensor(s) in that area. The
line chart clearly shows the variations of crowd number and indicates peak and off-peak
hours.

19
Figure 2.10: An example of daily line chart shown on web page at Grand-Place on December
27

2.4 Results

2.4.1 Data preprocessing

Before building time series models, we have to deal with raw measurements gathered by
sensors. Firstly, as the measurements are stored in server database during the event, we
separate them day by day and export files into local MySQL database. The next step
focuses on counting the number of unique detected MAC addresses at each time instant t.
As shown in Figure 2.11, when visitors stay within the detection range of a sensor, their
electronic device(s) keep(s) broadcasting their MAC address(es). We have then to estimate
the appropriate time interval a such that we count the number of detected attendees included
in the time period [t-a/2, t+a/2].

20
Figure 2.11: Example of consecutive MAC addresses detected on the same electronic device.
The MAC address is split into two parts in database: the OUI and NIC parts

Firstly, as already mentioned in Section 2.3.1, Wi-Fi device (including smartphones, laptops
and notebooks) usually sends out a probe request at least once every two minutes. On the
other hand, we computed time period between successive MAC addresses sent by same elec-
tronic device; for that point, we extracted data for sensor 3 on December 23 and plotted the
probability of occurrence in function of the time interval between successive MAC addresses
- limited to 5 minutes (shown on Figure 2.12). This figure represents histogram made of the
most used time intervals. Besides, it may be noticed that performing same calculations on
other sensors or days provides very similar results. As can be seen on Figure 2.12, almost
no beacon signals are sent out after 100 seconds. A few very small insignificant peaks are
still left however, this is due to people who can leave the area and come back within the
hour.

21
Figure 2.12: Beacon time interval calculation

Figure 2.13 provides the cumulative distribution function of the histogram.

Figure 2.13: Cumulative Density Function of beacon time interval

22
Consequently, we selected time interval a equals to two minutes because of the literature
and the experimentation we carried out - 67% of the peaks (all the significant ones) of CDF
are then included. Figure 2.14 illustrates the number of unique MAC addresses gathered
by sensor 1 on December 23, where time interval a equals to two minutes. This means for
each time instant t, we counted the number of unique detected MAC addresses in the time
range [t-1, t+1].

Figure 2.14: Measurements made by sensor 1 (Place de la Bourse) on December 23

2.4.2 Crowd movements between sensors

As explained in Section 2.3.2.2, sensors were placed at different event areas and streets.
We then decided to analyze the potential crowd movements between sensors; by tracking
appearances of their MAC addresses at different sensors, we can predict crowd walking
directions. This analysis also helps us to understand the influence of one sensor to another
sensor.

Table 2.3 displays the results based on measurements made on December 31 from 16:00 to
17:00. Diagonal values are the number of unique MAC addresses detected by each sensor

23
on December 31 from 16:00 to 17:00. Value located at (row i, column j) corresponds to the
number of MAC addresses firstly detected by the ith sensor, and then by j th sensor; this
indicates crowd flows from one area to another area. By observing row by row, we can tell
which area is the next most likely destination when people leave, while column by column
analysis allows us to determine the contributions of other sensors to one particular sensor.
For instance, in the first column, among the 5747 visitors detected by sensor 1 (Place de
la Bourse) for one hour, 254 came from sensor 4 (area of Rue Sainte-Catherine) while 154
came from sensor 7 (Grand-Place).

Sensor Number 1 2 4 5 7
1 5747 57 303 60 176
2 71 7120 80 30 117
4 254 73 5839 124 114
5 30 31 83 6179 38
7 154 76 126 45 12742

Table 2.2: Crowd movements between sensors on December 31 from 16:00 to 17:00

Furthermore, MAC address randomization creates difficulty in analyzing the crowd move-
ment, since it becomes hard to track visitors moving from one zone to another. Therefore,
for each sensor coverage area, a ratio of number of MAC addresses detected once to total
detected MAC addresses is computed, which is denoted as r in Table 2.3. The reason that
MAC address is detected only once could that the visitor is passing by or randomized MAC
address is generated from time to time. Here we assume MAC address randomization hap-
pens at every sensor area and the occurrence probability is the same. As a result, most
visitors spent a very short time or most of detected people were only passing by at Place de
la Bourse (sensor 1) and Place de la Monnaie (sensor 2). On the other hand, visitors tend
to stay longer at Grand-Place (sensor 7) and especially at Marché aux Poissons (sensor 5),
where dozens of chalets and shops attracted many people. Unlike other sensors, sensor 4 is
located at a crossing, surrounded by markets as well as office buildings, and might detect
more workers and shoppers than visitors. Those facts could explain the low ratio.

24
Sensor Number 1 2 4 5 7
r 0.8320 0.8156 0.0923 0.0740 0.4991

Table 2.3: Ratio of number of MAC addresses detected once to total detected MAC addresses
on December 31 from 16:00 to 17:00

2.4.3 Comparison with Proximus measurements

We finally compared measurements gathered by our sensors with the ones from Proximus,
in order to estimate the accuracy of our measurements. Proximus calculates the number
of attendees to Plaisirs d’Hiver based on the electronic devices connected to each antenna
zone. Figure 2.15 shows all the zones covered by Proximus for this specific event.

Figure 2.15: Antenna coverage by Proximus

However, since Proximus holds around 40% of the Belgian mobile market 2 , Proximus regis-
tered only a part of the effective number of attendees and needed then to use an extrapolation
factor to pass from the measurements to the real numer of attendees. Figure 2.16 compares
the number of attendees measured by our sensors and by Proximus on December 27; in
general, our measurements and the ones made by Proximus have the same shape.
2
Value extracted from Proximus "Report and Results 2018 Q1", Brussels on May 4

25
However, there exist differences between the two curves, that could be explained by the
three following reasons:

• Only the electronic devices with Wi-Fi turned on are detected by our sensors, which
is a negative bias to the real crowd number.

• As explained in Section 2.3.1, MAC address randomization creates a positive bias,


because a same electronic device can be counted multiple times.

• Our sensors cover a much smaller area than antennas of Proximus (see Figure 2.15).

Therefore, the ratio of real number of attendees to total number of visitors measured by
Proximus is lower than ours, since Proximus takes into account residents, workers and
passengers on transport around Plaisirs d’Hiver. Besides, the last point can be illustrated
by Figure 2.17. In the figure, an extrapolation factor 1.55 is used such that we multiply our
measurements with that factor to increase the comparison with Proximus data; although
the curves globally fitted approximately during the day, they did not match during the
night. Indeed, Proximus antennas cover many buildings and streets, whereas our sensors
only cover main event areas.

×10 4
8
Data by our sensors
7
Data by Proximus

6
Crowd Number

0
2 4 6 8 10 12 14 16 18 20 22 24
Time (Hour)

Figure 2.16: Comparison between Data on December 27

26
×10 4
8
Data by our sensors with an extrapolation factor
Data by Proximus
7

Crowd Number
5

0
2 4 6 8 10 12 14 16 18 20 22 24
Time (Hour)

Figure 2.17: Comparison between Data on December 27 with an extrapolation factor 1.55

27
3. Theoretical background
Now that we have built a system to capture beacon signals sent out from different devices and
have this raw data, we can make models to predict this data. This type of data is time series
data and can be used for the forecasting techniques to be discussed later. In this chapter
we first introduce the basic concepts to lay a necessary mathematical foundation in order to
better grasp what a time series really is. Notions of stationarity, univariate modeling, first-
and second-order moments are introduced. Different types of classifications and components
are discussed as well as the stochastic process needed to understand time series data. The
definitions of the auto-correlation function and the partial auto-correlation function are given
which are of great importance. The properties of these functions are analyzed. Definitions
of several useful operators are presented. Afterwards multiple techniques are shown which
can serve as a test to assess the stationarity of a time series. At the end, more consideration
has been given to the concepts of underfitting and overfitting since this plays a major role
in constructing the optimal forecasting model.

3.1 Deterministic and stochastic processes

A deterministic process assumes that its outcome is certain if the input to the model is fixed,
meaning no matter how many times experiments are performed, we always get the same
result. [ST18] In nature, the parameters of a deterministic model are known or assumed.
Deterministic models describe behavior on the basis of physical laws; we can calculate the
value of some time-dependent quantity nearly exactly at any instant of time. For instance,
we might calculate the trajectory of a missile launched in a known direction with known
velocity. If exact calculation were possible, such a model would be entirely deterministic.
However, real-life phenomenon is unlikely to be totally deterministic due to unknown factors;
in the missile example, the wind velocity variable can throw a missile slightly off course.
In many problems, we have to consider a time-dependent phenomenon, such as monthly

28
sales of newsprint, in which there are many unknown factors and for which it is not possible
to write a deterministic model that allows exact calculation of the future behavior of the
phenomenon. [BJ70] Nevertheless, it remains possibilities to derive a model that can be
used to calculate the probability of a future value lying between two specified limits. Such
a model is called a probability model or stochastic model.

A stochastic process is a collection of random variables indexed by a variable t, usually


representing time. [DDV15] The same set of parameter values and initial conditions will
lead to an ensemble of different outputs. For each model run, unique input results into a
different output due to the random components of the modeled process. Therefore, multiple
runs are used to estimate probability distributions. [ST18]

3.1.1 First and second-order moments of stochastic processes

Let Xt , t ∈ T be a stochastic process.

The mean function of Xt is defined by

µX (t) ≡ E(Xt ), t∈T (3.1)

The variance is introduced by

2
σX t
≡ V ar[Xt ] ≡ E[(Xt − µX (t))2 ] (3.2)

The covariance function of Xt is given by

γX (r, s) ≡ Cov(Xr , Xs ) ≡ E[(Xr − µX (t))E(Xs − µX (t))], r, s ∈ T (3.3)

Definitions (3.1, 3.2, 3.3) come from [GJ]. The mean is a central value of a discrete set
of sample and only exists when the variance is defined. The variance quantifies how far
a set of samples are spread out from the mean value. The covariance measures the linear
dependence between two random variables. From the above definitions, we deduce that
V ar[Xt ] = Cov[Xt , Xt ].

29
3.2 Time series

A time series is a set of observations indexed according to the order they are obtained in
time. This can be formulated as x1 , x2 , ..., xt . They are taken from the series of random
variables X1 , ..., Xn . [GJ] x1 is the observation at the first time point, the variable x2 denotes
the value for the second time period and xt the value for the t time period. The successive
values of the time series must be spaced by equal time intervals, also called lags, which
causes that a time series refers to a sequence composed of discrete-time data. [DFT89]

3.2.1 Modeling

Time series analysis can either be univariate or multivariate; both univariate and multivari-
ate time series models are meant for forecasting purposes.

Univariate time series contains single observations that are recorded sequentially over
equal time increments. It is also possible to augment a simple univariate time series with
some exogenous set of parameters. [IO16] Therefore, univariate time series will involve one
dependent variable and one/many independent (exogenous) variable(s), such as weather
variables.

On the other hand, multivariate time series model is an extension of the univariate case
and involves two or more input dependent variables. It does not limit itself to its past
information but also incorporates the past of other variables. Multivariate processes arise
when several related time series are observed simultaneously over time, instead of observing
a single series as in univariate case, and refers to studying the interrelationship among
time series variables. These relationships are often studied through consideration of the
correlation structures among the component series. [IO16] Unlike univariate time series,
exogenous parameter(s) can be involved in a multivariate time series.

3.2.2 Stationarity

Based on the dynamic phenomenon, time series can be distinguished between stationary
and non-stationary; stationarity means that the stochastic relationship between one obser-
vations and the next one does not depend on when the observation is made. [HA14] To

30
model the time series data with more accuracy, it is of utmost importance to transform the
input data into a stationary time series. Stationarity time series can be formally defined in
terms of weak and strict stationarity; weak stationarity only concerns the shift-invariance
(in time) of first and second moments of a process, introduced in Section 3.1.1, whereas
strict stationarity concerns the shift-invariance (in time) of its finite-dimensional distri-
butions. [BD02]

• Weak stationarity

E[Xt ] = µ ∀t

E[(Xt − µ)2 ] = σ 2 < ∞ ∀t

γX (r, s) = γX (r + t, s + t) ∀r, s, t ∈ Z

which implies that γX (r, s) is a function of r - s, which is convenient to define as

γX (h) = γX (h, 0)

where h refers to the lag order.

A process satisfying these assumptions (coming from [BD02]) is said to be weakly sta-
tionary, or covariance stationary. Therefore, a weak stationary time series does not exhibit
any systematic change in mean (no trend), any systematic change in variance nor periodic
fluctuations.

• Strict stationarity

Let xti be the observation at time ti . A time series is strictly stationary if the joint distri-
bution of xt1 , xt2 , xt3 , ..., xtn is the same as the distribution of xt1+k , xt2+k , xt3+k , ..., xtn+k
∀ n and k ∈ Z. [BD02]

31
3.2.3 Decomposition

Time series are usually investigated to determine an historical pattern that can be exploited
in the preparation of a forecast. In order to identify the pattern, any time series can
be decomposed into several components: trend, seasonality, cyclic changes and remainder
component (random noise). Decomposing a series into those components enables us to
analyze the behavior of each component and then to improve the accuracy of the final
forecast. Definitions of those four components are taken from [HA14] and [DDV15].

• Trend

There exists a trend in the time series when the data is not fluctuating around a constant
mean but rather displays an upward (or downward) evolution over a long period of time.
Besides, the trend can be defined by a simple linear model 3.4

x=at+b (3.4)

where a and b refer to the parameters of the trend.

Furthermore, a least-squares method [DM18] can be used to determine the most optimal
a and b coefficients, respectively â and b̂ :

X  2
[ â b̂ ] = argmin xt − (a t + b) (3.5)
a, b t

xt represents the time series data sample at time period t. Minimizing Expression 3.5
produces a fit for the trend line.

• Seasonality

Seasonality can be seen as patterns in time series data that is repeated over a certain amount
of time. This can be called periodic fluctuations. Seasonality presents thus a great foun-
dation for forecasting algorithms which can easily exploit this type of behavior. Seasonal
fluctuations are most often attributed to social customs or weather changes. Also impor-
tant to note is that periodic pattern changes can only be attributed to seasonality if period
between patterns is less than a year.

32
• Cyclic changes

Cyclic changes are similar to seasonal patterns but differ in the fact they correspond to
non-periodic fluctuations (without any predictable period) and the time between cyclical
patterns is variable (usually takes longer than one year).

• Remainder

In case the trend, seasonality and cyclic changes have been modeled, the only remaining
result will be the irregular component, also known as a residual series. Therefore if the
residuals of a time series model show some correlation (a pattern for instance), the time
series is not decomposed optimally in its components.

3.3 Correlation functions

3.3.1 Autocorrelation function

The ACF presents the correlation of the time series samples with shifted samples and is
defined as :
γX (h)
ρX (h) = (3.6)
γX (0)

ACF measures the predictability of xt by using value xs (h = t − s). If xt can be predicted


by xs through a perfect linear relationship (xt = α xs + β), then ρX (h) is either 1 or -1.
[SS06] By looking at the decline of the ACF, we can deduce how many lags (past samples)
are correlated with the current sample. This can be used in models to recognize patterns in
the data.

3.3.2 Partial autocorrelation function

The PACF of a stationnary process xt , denoted as φhh , is defined as follows :

φ11 = corr(x1 , x0 ) = ρ(1), h=1 (3.7)

and
φhh = corr(xh − xh−1 h−1
h , x0 − x0 ), h≥2 (3.8)

33
where xh−1
h refers to the regression of xh on xh−1 , xh−2 , . . . , x1 , which is written as

xh−1
h = β1 xh−1 + β2 xh−2 + · · · + βh−1 x1 (3.9)

In addition, we could write x0h−1 as

x0h−1 = β1 x1 + β2 x2 + · · · + βh−1 xh−1 (3.10)

Definitions (3.7, 3.8, 3.9, 3.10) come from [SS06]. Following the same procedure as with
the ACF we count the amount of lags that are significant (above the noise threshold) to
know how many delayed samples the current sample is influenced by.

3.4 Some operators

The backward shift operator B is defined by :

B xt = xt−1 (3.11)

Hence B m xt = xt−m .

The inverse operation is performed by the forward shift operator F = B −1 and is given by :

F xt = xt+1 (3.12)

Hence F m xt = xt+m .

Another important operator is the backward difference operator ∇ which can be written in
terms of B, since

∇xt = xt − xt−1 = (1 − B)xt (3.13)

34
In turn, ∇ has for its inverse the summation operator S given by :


−1
X
∇ xt = S x t = xt−j
j=0

= xt + xt−1 + xt−2 + ...

= (1 + B + B 2 + ...)xt

= (1 − B)−1 xt (3.14)

Definitions (3.11, 3.12, 3.13, 3.14) are taken from [BJ70].

3.5 Assessing stationarity and transformations

Statistical properties of data are not often constant over time. However, the necessary
condition to perform time series analysis is that data is weak stationary. [NS13] In order
to identify whether data is stationary, the following approaches can be carried out:

• ACF plot

The ACF is plot over a number of different lags. The principle is to check if there exists
correlation between the current and earlier samples in the data. In case the data exhibits a
non-stationary behavior, the ACF plot will not immediately go down, instead, it will either
slowly decrease with a smaller slope or oscillate.

• Unit root test

Consider the following example:


xt = ρ xt−1 + t

where ρ is a constant and t an error term.

ρ is the parameter of interest. Note that it has implications for the type of process on x :

• If ρ > 1, x is an explosive process.

• If ρ = 1, x is a random walk.

• If ρ < 1, x is a stationary process.

35
In case ρ = 1, the process x is said to have a unit root. Unit root test [SS06] is designed
to test whether a time series variable possesses a unit root and is then non-stationary. The
null hypothesis is that the variable contains a unit root, under the alternative hypothesis ρ
< 1 (meaning the time series is stationary or trend stationary depending on the used tests).
The unit root tests include DF test, ADF test, Schmidt–Phillips test and Zivot-Andrews test.

Transformations to achieve a weak stationary time series are the following :

• Taking the logarithm or square root of the time series are two ways for stabilizing
the variance. [NS13] In case of negative or null data, an offset is added to all the
samples to make data positive before applying the transformation. We will take the
inverse logarithm transform by paying attention to remove the constant to obtain the
predicted values and forecast future points.

• Differencing based on Equation 3.13 or smoothing by using a moving average


defined by 3.15 are the most common used techniques to stabilize the mean of a time
series, [NS13] which then results into removing the trend contained in the data. It
can be noticed differencing is a more acceptable option because it does not require the
estimation of parameters for the model and the trend components can vary over time.
This is in contrast with model fitting as those components will be estimated once and
then remain fixed.

1
yt ≡ Σq xt+i (3.15)
2q + 1 i=0

where q refers to the sliding window, which smoothens the time series.

36
3.6 Concepts of underfitting and overfitting

Figure 3.1: Underfitting and overfitting models. Source [DM18]

Figure 3.1 highlights the underfiting and overfitting problems when trying to determine a
model based on data samples; the two above graphs show in case the model underfits the
data, i.e. it poorly fits the samples, it will not be possible to retrieve the behavior of the
initial curve. Indeed for M = 1 coefficient taken into account in the model, we try to fit the
data into a linear regression model which will cause significant fitting error. This very simple
model is then not able to capture the complexity of the underlying distribution created by
the data. On the other hand, if the model overfits the data, i.e. it perfectly fits the data, it
will be poor for making out-of-sample predictions. Indeed, we try to create a very complex
model, more complex than the underlying cause and we are then trying to fit into noise. For
instance, the curve for M = 9 perfectly passes all the data samples, meaning the training
error is null. Both cases are bad since the defined models cannot be used in other scenarios.
Finally, the model presenting a training error was selected since it was tried with external
data and provided satisfying results in comparison to the other models. [DM18]

37
4. Univariate non-seasonal
forecasting model
The previous chapter presented the basis notions to understand modeling and forecasting
time series data. The following sections present a class of forecasting methods, namely
ARIMA models, allowing to predict from the past and current samples the data within
a range of several minutes. At first the AR model as well as a number of properties are
introduced. Next, the MA model is explained and the duality between the AR and MA
processes is detailed. With that background, ARIMA models are presented. Based on Box-
Jenkins procedure, the three stages - identification, estimation, diagnostic checking - are
applied to determine the correct ARIMA orders for then performing forecast. The last part
presents the forecast results with the developed ARIMA models.

4.1 AR model

The AR model expresses the future value as a linear combination of past variables. As the
name "auto-regressive" might suggest, this process involves linear regression of the variable
against one or more past values of the same time series, and allows to perform forecasting
since there is some correlation between points in a time series. [D00] The common notation
is AR(p), where p designs the order of the model and defines how many lagged past values
must be taken into account to handle the time series pattern.

An AR process of order p [DFT89] is a stationary process verifying the relation 4.1.

p
X
xt = φi xt−i + t ∀t ∈ Z (4.1)
i=1

Where

• xt−i represent x at time t − i

38
• φi correspond to the auto-regressive coefficients at lag i

• t is white noise with zero mean and σ 2 (i.e. randomness)

Therefore, each observation x is composed of a random component  and a linear combi-


nation of previous observations, whose effects depend on the corresponding auto-regressive
coefficients φ. The error  is also called innovation. At end, the residuals are assumed to be
random in time (not autocorrelated) and normally distributed. [D00]

It should be noticed an AR process will only be stable if the parameters are included within
a specific interval; for instance, the first order AR(1) process has the form of a regression
model for which the current value xt is based on the immediately preceding value xt−1 . We
must have −1 < φ1 < 1 otherwise the past effects would accumulate and the series would
not be stable anymore. [HA14] Figure 4.1 displays the graphs for respectively AR(1) with
φ1 = 0.9 and AR(1) with φ1 = 1.

AR(1) process with φ1 = 0.9 AR(1) process with φ1 = 1

Figure 4.1: Two examples of data from AR models with different parameters. Left: AR(1)
with xt = 0.9xt−1 + t . Right: AR(1) with xt = xt−1 + t . In both cases, t corresponds to
a normally distributed white noise with µ = 0 and σ 2 = 0.1

For AR(p) model, the ACF (presented in Section 3.3.1) will not be zero for lags > p,
whereas the PACF (presented in Section 3.3.2) will be equal to zero for lags > p. Equation
4.2 provides the PACF in case of AR(p) process.


0 if h > p



φh =  (4.2)

 others if h ≤ p

39
To illustrate this point, consider AR(1) model, i.e. xt = φ xt−1 + t with |φ| < 1. By
definition, φ11 = ρ(1) = φ. The ACF at lag 2 is computed as,

γ(2) = cov(xt , xt−2 )

= cov(φ xt−1 + t , xt−2 )

= cov(φ2 xt−2 + φ t−1 + t , xt−2 )

= φ2 γ(0)

While for PACF value, to calculate φ22 , we first consider to compute x12 , which is the
regression of x2 on x1 . By Equation (3.9), we get:

x12 = βx1

where β is chosen to minimize E(x2 − βx1 )2 . By taking derivatives and setting the result
γ(1)
to zero, we obtain β = γ(0)
= ρ(1) = φ. Thus, x12 = φ x1 . Next, we consider regression of
x0 on x1 , and by Equation (3.10), we have:

x10 = βx1

Again, β is chosen to minimize E(x0 − βx1 )2 and equals ρ(1). Hence, we have:

φ22 = corr(x2 − φx1 , x0 − φx1 )

Since γ(h) = γ(0)φh ,

cov(x2 − φx1 , x0 − φx1 ) = γ(2) − 2φγ(1) + φ2 γ(0) = 0

Thus, we have φ22 = 0.

40
4.2 MA model

Instead of using past values of the forecast variable in a regression, a MA model uses past
forecast errors in a regression-like model. [D00] The common notation is MA(q), where q
is the order of the moving average process.

A MA model of order q [DDV15] is a process defined by Equation (4.3)

q
X
xt = t + θi t−i ∀t ∈ Z (4.3)
i=1

Where

• t−i corresponds to the white noise sampled at i interval

• θi represent the moving average coefficients

• t refers to white noise with zero mean and σ 2

Therefore, each observation x is a linear combination of recent innovations, whose the impact
depends on the corresponding MA coefficients θ. [DFT89] Figure 4.2 shows the graphs for
respectively MA(1) with θ1 = 0.8 and MA(2) with φ1 = −1 and φ2 = 0.8.

MA(1) process with θA = 0.9 AR(1) process with φ = 1

Figure 4.2: Two examples of data from MA models with different parameters. Left: MA(1)
with Xt = t + 0.8t−1 . Right: MA(2) with Xt = t - t−1 + 0.8 t−2 . In both cases, t is
normally distributed white noise with µ = 0 and σ 2 = 1.

For a MA(q) process, which is defined in Equation (4.3), is stationary in mean, so we have:

q
X  
E(xt ) = θj E t−j
j=0

41
Its autocovariance function is:

γ(h) = cov(xt+h , xt )
q
X q
X
= E[( θj t+h−j )( θk t−k )]

j=0 k=0 (4.4)
2 Pq−h
σ θj θj+h if h = 0, 1, 2, ..., q


j=0

=
0 if h>q


Then the ACF is obtained dividing by a γ(0),

 Pq−h
θ θj+h
j=0 j
if h = 1, 2, ..., q


1+θ12 +...+θq2

ρh = (4.5)
0 if


 h>q

We see that the ACF of a MA(q) process is zero, beyond the order q, of the process. In
other words, the ACF of a MA process has a cut-off at lag q.

4.3 Duality between AR and MA

Table 4.3 introduces the duality between AR and MA processes.

AR(p) MA(q)
Exponentially decreasing or
ACF Cuts off after lag q
damped sine wave
Exponentially decreasing or
PACF Cuts off after lag p
damped sine wave

Table 4.1: Duality in correlation patterns for AR and MA processes. Source [ST18]

Any stationary AR(p) model may be written as a MA(∞) process. For example, by using
repeated substitution, an AR(1) model becomes:

xt = φ1 xt−1 + t

= φ1 (φ1 xt−2 + t−1 ) + t

= φ21 xt−2 + φ1 t−1 + t

= φ31 xt−3 + φ21 t−2 + φ1 t−1 + t (4.6)

42
With 1 < φ1 < 1 such that φk1 gets smaller as k increases. Therefore, Equation 4.6 can
finally be rewritten as a MA(∞):


φk1 t−k
X
xt = t + (4.7)
k=0

Any invertible MA(q) model may be described by an AR(∞) process. For example, by
using repeated substitution, a MA(1) model becomes:

xt = t + θ1 t−1

= t + θ1 (xt−1 − θ1 t−2 )

= t + θ1 xt−1 − θ12 (Xt−2 − θ1 t−3 )

= t + θ1 xt−1 − θ12 Xt−2 + θ13 t−3

and can finally be defined as:

n−1
(−θ1 )n t−n = t − (−θ1 )k xt−k
X
(4.8)
k=0

However, if |θ1 | < 1, then

n−1
!2
 
k
= E θ12n 2t−n −−−−→ 0
X
E t − (−θ1 ) xt−k (4.9)
n−
→∞
k=0

and we say that the sum is convergent in the mean square sense. [HA14] Hence, Equation
4.8 can be rewritten as a AR(∞):


(−θ1 )k xt−k
X
t = (4.10)
k=0

4.4 ARIMA models

Based on the two presented models, it is already possible to describe a well-known forecasting
time series: ARMA. [DDV15] This model supposes weak stationarity for the input data.
As mentioned in Section 3.5, this indicates the mean as well as variance of the time series are

43
constant over time. Besides, the best approach to eliminate the trend is to use differentiation.
A time series that needs to be differentiated to get the stationarity is considered as an
integrated version of stationary series. Therefore, we decided to work with the generalization
of ARMA: ARIMA, where the integration term allows to get constant mean - one of the
conditions required to consider the time series as weak stationary.

ARIMA models [D00] combine then three distinct processes:

• AR, which supposes each point can be described by a weighted sum of past points,
plus a random term.

• Integrated, stating each point presents a constant difference with the previous point.

• MA, which implies each point is function of innovations affecting the previous points,
plus its own error.

The common notation is ARIMA(p,d,q) [ST18] and is given by equation 4.11.

φ(B)(1 − B)d xt = θ(B)t ∀t ∈ Z (4.11)

Where

• B corresponds to the backward shift operator

• d represents the difference operator, which refers to the number of times that we
differentiate

Pp
• φ(B) = 1 − i=1 φi B i

Pq i
• θ(B) = i=0 θi B

ARIMA (p,d,q)
White noise ARIMA (0,0,0)
Random Walk ARIMA (0,1,0) with no constant
Random Walk with drift ARIMA (0,1,0) with constant
AR ARIMA (p,0,0)
MA ARIMA (0,0,q)

Table 4.2: Examples of ARIMA models. Source [ST18]

44
4.4.1 Identifying an ARIMA model

We applied the Box and Jenkins methodology


[BJ70] to select the appropriate ARIMA orders.
This method implies the model determination
in three steps: identification, estimation and di-
agnostic checking, as presented in Figure 4.3.
The first stage consists in identifying the orders
of the ARIMA(p,d,q), followed by the estima-
tion of the model and then diagnostic checking
is performed on the residuals to evaluate if the

Figure 4.3: Box-Jenkins technique. estimated model is well-fitted. In case the di-
Source [BJ70] agnostic checking fails meaning the model does
not fit well, we have to go back to the identification stage with the target of finding a better
model. Procedures followed in identification, estimation and evaluation stages are taken
from [HA14], [ST18] and [WS18].

• Identification starts with finding the appropriate order of integration in level d


based on ADF unit root test. We then visualize the PACF and ACF graphs to make
guesses respectively regarding the AR order p and the MA order q.

• Estimation is the next step to be carried out: we select p and q orders that minimize
several information criteria - AIC, BIC, SSE and the variance (σ 2 ). Besides, AIC and BIC
are more important criteria for model selection, because they are key factors to prevent
overfitting problem. This methodology allows to assess the quality of the model.

• Evaluation is finally done to verify the adequacy of the selected model(s) at the
estimation stage. This step implies diagnostic checks on the residuals, which correspond to
the differences between the observed time series values and the determined ARIMA model.
If the model fits well, the residuals should be distributed as Gaussian white noise, that is
random, homoscedastic and normal. Indeed, residuals should be small and no systematic
or predictable patterns should be left in those residuals since all the relevant information
should have been extracted for forecasting. When focusing on the diagnostic checking, we
first visually analyze the PACF and ACF graphs of residuals to determine if the residuals

45
diverge strongly from white noise. Besides, Ljung-Box is used as statistical test for checking
the serial correlation of the left residuals from the fitted model and ADF tests if those
residuals are stationary. The model should pass all these diagnostic checks to be considered
well-fitted and appropriate for forecasting. In this work, Ljung-Box test will hold most
weight in deciding if any model should be rejected, implying a return to the identification
stage of the Box-Jenkins procedure.

4.4.1.1 Identification

ADF test [WS18] [S15] is used to evaluate the null hypothesis of a unit root against the
alternative of stationarity and is based on the following model.

p−1
X
∆Xt = (ρ − 1)Xt−1 + δi ∆Xt−i + t (4.12)
i=1

where ρ is the parameter of interest, ∆ the first difference operator, δi are parameters and
p corresponds the lag order of the auto-regressive process. The lagged differenced variables
are included to account for possible serial correlation that would otherwise appear in the
error term t which is assumed to be approximately a white noise process.

What is tested is the null hypothesis of a unit root ρ = 1, meaning data is non-stationary,
versus the alternative hypothesis |ρ| < 1, i.e. the time series is stationary. The test statistic
that is used is based on the t-type statistic:

ρ̂ − 1
DFτ = (4.13)
SE(ρ̂)

where the estimated value of the test statistic should be compared to the value of the relevant
critical value of the Dickey-Fuller test.

46
Correlograms

AR(p) MA(q) ARMA(p,q)


Exponentially decreasing
Exponentially decreasing
ACF Cuts off after lag q or damped sine wave
or damped sine wave
after q − p lags
Exponentially decreasing
Exponentially decreasing
PACF Cuts off after lag p or damped sine wave
or damped sine wave
after p − q lags

Table 4.3: Summary of correlation patterns. Source [ST18]

4.4.1.2 Estimation

AIC is a means to express the distance between the data and the model. Naturally, the
distance should be as small as possible. AIC introduces a term to penalize overly complex
models. It is the criterion that is the most often used as it produces great results with
regards to the behavior of the fit as well as its simplicity [SS06].

The AIC is formulated as


n + 2k
AIC ≡ ln(σ̂k2 ) + (4.14)
n

Where

• σ̂k2 is the maximum likelood estimator for the variance (given by Equation 4.15).

• k corresponds to the number of parameters in the model

The maximum likelihood estimator is defined as

RSSk
σ̂k2 ≡ (4.15)
n

with RSSk being the Residual Sum of Squares of a model with k coefficients.

BIC is given by Equation 4.16.

ln(n)k
BIC ≡ ln(σ̂k2 ) + (4.16)
n
47
This method is known to deliver a more consistent estimate to identify the orders of a model.
Asymptotically, it is shown that the BIC would select a more parsimonious model as opposed
to the AIC. However the AIC is more often used so that more significant parameters can
be used for the model [SS06].

SSE measures the distance between two observations (true samples and predicted samples)
quadratically and returns its sum.

SSE ≡ ΣN
i=1 (xi − x̂i )
2
(4.17)

The SSE is very easy to implement and does perform relatively well except for overfitting.
It must be noted that AIC and BIC are very different methods than the SSE. AIC and BIC
penalize the model in a sense to avoid overfitting, unlike SSE.

Variance σ 2 is a means to show how much the samples deviate from their corresponding
mean and can be used to identify the goodness of a fit. The underlying idea is that samples
must more or less follow the same distribution.

4.4.1.3 Diagnostic checking

The Ljung and Box test [NS13] [ST18] is a diagnostic tool used to test the lack of
fit of a time series model and is applied to the residuals of a time series after fitting the
ARIMA(p,d,q) model to the measurements. This test implies to check the randomness of the
residuals and then to determine if the residuals can be considered as serially independent.

Generally, the Ljung and Box test is defined as:

• H0 : the data are independently distributed

• Ha : the data are not independently distributed and exhibit serial correlation.

Given a time series X of length n, the test statistic is given by:

m
X rl2
Q(m) = T (T + 2) (4.18)
l=1 T −l

Where

48
• rl is the estimated autocorrelation of the series at lag l

• m corresponds to the number of lags that are tested

The null hypothesis of randomness is rejected, meaning the model has significant lack of fit,
if

Q(m) > χ2α

Where χ2α is 100(1 - α)th quantile of the Chi-Squared distribution table value with m degrees
of freedom and significance level α.

It may be noticed most packages will actually calculate p-value. We will reject the null-
hypothesis if the p-value is sufficiently small, in other words

p<α

where α refers to the significance level. We usually take α = 5%

4.5 Experimental results

Now we have introduced the stages and associated tests to determine the appropriate orders
p, d, q for ARIMA model, we can reproduce this procedure for a specific sensor, here sensor
5, which covers Marché aux Poissons. Afterwards, we will display the results obtained for
other sensors, based on the same applied method. All experiments are carried out with R
software. R is a very well developed, simple and effective programming language. It has
been extended by many packages which are proper for time series analysis.

4.5.1 Model identification

The first step for identifying the well-fitted model is to plot the time series as it can directly
reveal features, such as trend, that are present in the data. For each sensor, we split the
data into two sets: the training set for determining the ARIMA model is performed on on

49
day and we then test it on the raw measurements of another day. Before the model is built,
we downsample raw measurement to get a time resolution equals to 3 minutes.

Figure 4.4: Number of unique detected MAC addresses at Marché aux Poissons (sensor 5)
on December 27

Mean Variance Std. dev. Minimum Maximum Median


393.3 12750.8 112.919 74 581 411

Table 4.4: Descriptive statistics of the raw measurements at Marché aux Poissons on De-
cember 27

Figure 4.4 shows a time series plot of of sensor 5 on December 27, 2017. The horizontal axis
represents time (in minutes) and corresponds to approximately 10 hours, from 11:30 until
23:30. The vertical axis is the number of unique MAC addresses detected at each time step.
Obviously, this is a non-stationary process since we can clearly see a trend, where the data
firstly increases then decreases. So, we apply logarithm transformation on the data to have
stationarity in variance and differentiate once to get stationarity in mean (remove the trend).
The differentiation order is also assessed with ndiffs function in R. Those transformations
allow to satisfy the conditions for weak stationary time series. The final result is shown in
Figure 4.5, the vertical axis is the data after transformation, where we can see that there is
no trends nor periodic fluctuations in the data anymore.

50
Figure 4.5: Result of differentiated logarithm transformed data

Then we analyze ACF and PACF plots of the differentiated data. As explained in Chapter
4, ACF plot helps to visualize autocovariance coefficients and gain an insight in MA order
q. Similarly, by inspecting PACF plot, we can determine the AR order p. In Figure 4.6,
the ACF cuts off at lag 1, the coefficients of rest of the lags are below white noise level,
therefore we assumed to be a MA(1) process. Regarding the PACF, we observe there is 1
significant lags, which supposes we will have AR(1).

Figure 4.6: ACF and PACF of the transformed data. The blue dashed lines represent the
confidence interval of white noise.

51
To ensure the stationarity, we run ADF test on the transformed data, the detailed result is
shown in Table 4.5. As can be seen, the null hypothesis is not rejected and our transformed
data is stationary.

Augmented Dickey-Fuller
Dickey-Fuller Lag order p-value
-7.766 5 <0.01

Table 4.5: ADF test results for the transformed data at Marché aux Poissons (sensor 5) on
December 27

4.5.2 Estimation

As advised in previous section, AIC is computed to select the candidates orders of the model.
We then use forecasting package in R to automatically fit our model with parameters p,q
and d. The auto.arima function runs combinations of order parameters and selects the one
with optimum model fit criteria according to AIC value. Table 4.6 summarizes possible p
and q combinations. Finally, the ARIMA (0,1,1) model is selected.

p d q AIC SSE σ2
0 1 1 -222.02 4.26 0.0202
0 1 2 -220.47 4.25 0.0201
1 1 1 -220.49 4.25 0.0201
1 1 2 -218.52 4.25 0.0200
2 1 1 -218.50 4.25 0.0200

Table 4.6: ARIMA model estimation results. Model ARIMA (0,1,1) minimizing AIC is
selected and highlighted in bold.

The ARIMA model of data can be written as final expression, given by Equation (4.19).

yt = t − 0.5688t−1 (4.19)

Where

• yt = log(xt ) and xt corresponds to the number of unique detected MAC addresses

• t ~ Normal (0, 0.02)

52
Figure 4.7 shows the fit provided by the model. The prediction horizon is set to be one
time step, so that the value at time step t depends on values at previous steps t − 1 plus
some random noise. The red line is the result of model fitting, blue polygon zone is the 80%
confidence bounds, and the gray polygon zone is the 95% confidence bounds. The MAE
of the model is 42.016. As can be seen from the plot, our model has a good fit on the
measurements, the following plots and tests help to confirm this point.

Figure 4.7: Result by applying ARIMA (0,1,1) on measurements at Marché aux Poissons
on December 27.

4.5.3 Diagnostic checking

To evaluate our model, we first check the ACF and PACF plots of model residuals. As it
can be seen in Figure 4.8, the autocorrelation and partial autocovariance coefficients are
all underneath noise level, there is no significant lag in both plots, which confirm that our
model is a good fit to data. Additionally, we run ADF test and Box-Ljung test and observe
their p values. The result is shown in Table 4.7. For Box-Ljung test, the null hypothesis is:
the data are independently distributed. If p-value < 0.05, we can reject the null hypothesis
under a 5% chance of making a mistake. Therefore, we say that the values of residuals are
dependent to each other. In the other case, where p-value > 0.05, we do not have enough
statistical evidence to reject the null hypothesis. We cannot point out anything regarding
to independence, the values of residuals can be dependent or independent. But it is a good
sign, as our p value is 0.88, which is greater than 0.05, we do not confirm null hypothesis.

53
For ADF test, as the p value is relatively high, we cannot reject null hypothesis (series is
non stationary).

Figure 4.8: ACF and PACF of the residuals. The blue dashed lines represent the confidence
interval of white noise.

Augmented Dickey-Fuller
Dickey-Fuller Lag order p-value
-1.0911 5 0.9217

Ljung Box
X-squared df p-value
0.06762 1 0.8833

Table 4.7: The ADF and Box-Ljung tests results of the residuals of prediction results at
Marché aux Poissons on December 27

54
The procedure for performing ARIMA modelling, with target to forecast future data, can
be summarized by the flowchart shown in Figure 4.9.

Figure 4.9: Summarized proceeding to achieve forecasting based on ARIMA

55
4.6 Prediction

Based on the ARIMA model, we made prediction on measurements for the same sensor on
another day. Figure 4.10 shows the prediction results for different time horizons, from 6
to 21 minutes. As can be viewed from the plots, the prediction result looks better when
the prediction horizon is shorter, this is due to the fact that prediction error at time step t
takes into account previously prediction errors. Since longer forecasts have more uncertainty
than shorter forecasts, it is more difficult to improve the performances for long prediction
horizon. This consequence is also reflected in the confidence intervals. Both 95% and 80%
confidence bounds get larger when forecast duration increases.

Above all are the main processes that we went through for analyzing and predicting real
measurements from sensors. By following these steps, we build ARIMA models for different
sensors which are located at Place de la Bourse, Place de la Monnaie and Grand-Place.
Additionally, several predictions are made based on the models with respect to different
prediction horizons. The parameters for ARIMA models are shown in Table 4.8.

Sensor Number 1 2 7
Date December 30 December 27 December 29
ARIMA(p,d,q) (2,1,4) (2,2,5) (2,1,4)
Coefficients ar1 -1.0115 ar1 -0.1143 ar1 0.2185
ar2 -0.8469 ar2 -0.8394 ar2 0.7479
ma1 0.4924 ma1 -0.5287 ma1 -0.6454
ma2 0.2316 ma2 -0.2643 ma2 -0.9272
ma3 -0.5247 ma3 -0.7494 ma3 0.3565
ma4 -0.1567 ma4 0.3799 ma4 0.2434
ma5 0.1885
AIC -222.22 -301.30 -104.04
BIC -194.13 -277.53 -75.61
σ2 0.0333 0.0302 0.0446
MAE 33.3764 23.2191 49.7755

Table 4.8: The ARIMA models for different event areas covered by sensors 1, 2 and 7. For
each area, the ARIMA model is built on an particular date. The coefficients AIC, BIC, noise
variance and MAE of each model are listed. The MAE is computed based on a prediction
horizon of 9 minutes.

56
(a) Prediction horizon 6 minutes

(b) Prediction horizon 21 minutes

Figure 4.10: Forecast results at Marché aux Poissons (sensor 5) on December 28 with two
prediction horizons: 6 minutes and 21 minutes. The forecast is obtained by applying ARIMA
(0,1,1). Gray and blue zones are 95% and 80% confidence areas.

57
Apart from the prediction results above, we generated prediction on other sensors by using
model (0, 1, 1) modeled from measurements of sensor 5, that were gathered on December 27.
The results are shown in Figure 4.11. The first graph is the prediction results of sensor 1
(Place de la Bourse) on December 31, while the second one represents the prediction results
of sensor 2 (Place de la Monnaie) on December 27. We may notice December 27 corresponds
to the day we modeled ARIMA model (0, 1, 1). The prediction horizon is 9 minutes in both
figures. As can be observed, prediction can be realized on different areas and on a different
day, but the uncertainty of accuracy is higher. If the characteristics are similar between two
areas, then by using model of one area on the other is applicable. However, if they have
totally different characteristics, then it is unlikely to have good prediction performances.

58
(a) Result at Place de la Bourse on December 31

(b) Result at Place de la Monnaie on December 27

Figure 4.11: Prediction results on Place de la Bourse and Place de la Monnaie using ARIMA
model developed from Marché aux Poissons on December 27. The prediction horizon is 9
minutes in both cases.

59
5. Seasonal ARIMA model
In the previous chapter, we have restricted our attention to non-seasonal data and an uni-
variate non-seasonal forecasting model, ARIMA. This technique was able to forecast the
number of attendees within short time periods. However accuracy of the forecast results
was degraded when choosing larger prediction horizons. In Chapter 3, we introduced time
series could be decomposed into four components, including a seasonal pattern, and each
component behavior could be individually analyzed with target of improving the accuracy of
the final forecast. Since ARIMA models are capable of both learning closeness and periodic
dependencies, the estimated model can also take into account the seasonal characters of the
time series. [ZQL17]

In this chapter, we detail the concept of seasonality in a time series and explain how to
include seasonality in ARIMA models. Afterwards, we describe the seasonal ARIMA mod-
elling procedure; the developed model is then used to forecast the number of visitors within
the next minutes. Prediction results are analyzed with respect to the defined prediction
time horizons. A discussion summarizes the whole process and points out the specificity of
seasonal forecast model in the context of crowd movements - the interest of seasonality in
this specific application, its advantages and limitations.

5.1 Seasonal time series

Seasonal pattern exists when a series is influenced by seasonal factors, like the month or
day of the week. Seasonality is always of a fixed and known period and can be viewed as
periodic patterns. [HA14] In general, we say that a series exhibits periodic behavior with
period s, when similarities in the series occur after s basic time intervals. For example,
retail sales reach peaks during the Christmas season and then decline after the holidays. So
time series of retail sales typically shows increasing sales from September through December
and declining sales in January and February. In this example, the basic time interval is one

60
month and the period s is 12 months. However, examples occur when s can take on other
values. For instance, s = 4 for quarterly data showing seasonal effects within years. [BJ70]

5.2 Mathematical model

Seasonal model incorporates additional seasonal factors in the ARIMA model introduced in
the previous chapter. One shorthand notation for the model [WS18] is given by 5.1:

ARIM A (p, d, q) × (P, D, Q)s (5.1)

where p corresponds to non-seasonal AR order, d non-seasonal differencing, q non-seasonal


MA order, P seasonal AR order, D seasonal differencing, Q seasonal MA order and s the
time span of repeating seasonal pattern.

Besides, a seasonal difference refers to the difference between an observation and the previous
observation from the same season. So

∆xt = xt − xt−s (5.2)

where s corresponds to the period of the season. These are also called “lag-s differences”,
as we subtract the observation after a lag of s periods. Seasonal ARIMA models are then
more complex models with adjustement for seasonality, which is taken care of by performing
seasonal differencing. [AS16] Just as for non seasonal time series, first differencing elimi-
nates linear trends and second differencing eliminates quadratic trends, seasonal difference
will eliminate additive seasonality.

The seasonal ARIMA model applied to the time series Xt [WS18] can be formally written
as:
ΦP (B S ) φp (B) (1 − B S )D (1 − B)d Xt = ΘQ (B S ) θq (B) Zt (5.3)

61
where the non-seasonal components are:

AR φp (B) = 1 − φ1 B − φ2 B 2 − ... − φp B p

MA θq (B) = 1 + θ1 B + θ2 B 2 + ... + θq B q

and the seasonal components are :

Seasonal AR ΦP (B S ) = 1 − Φ1 B S − Φ2 B 2S − ... − ΦP B P S

Seasonal M A ΘQ (B S ) = 1 + Θ1 B S + Θ2 B 2S + ... + ΘQ B QS

We can also notice the left part of Equation 5.3 multiplies together the seasonal and non-
seasonal AR components, while the right side multiplies the seasonal and non-seasonal MA
components.

5.3 Additive and multiplicative seasonality

As already explained, the time series xt can be seen as comprising four components: trend,
seasonal component, cycle changes and a remainder component (containing anything else
in the time series).

If we assume an additive model [DDV15], then we can write the time series xt as :

xt = St + Tt + Et (5.4)

where xt is the data at period t, St is the seasonal component at period t, Tt is the trend-
cycle component at period t and Et is the remainder component at period t. Alternatively,
a multiplicative model [DDV15] would be written as :

xt = St × Tt × Et (5.5)

62
Seasonality can then appear into two forms, additive and multiplicative, whose are illus-
trated on Figure 5.1. The additive seasonality is taken into account if the amplitude of
the seasonal fluctuations is independent of the time series level. On the other hand, when
the variation in the seasonal pattern appears to be proportional to the overall level of the
time series, then a multiplicative seasonality is more appropriate. [K14] Note that in the
example of multiplicative seasonality, the season is becoming “wider”. Obviously, if the level
was decreasing, then the seasonal amplitude of the multiplicative case would decrease as
well. For selecting the appropriate model to produce the forecasts, we need to know the
type of seasonality we are dealing with.

Figure 5.1: Respectively additive seasonality and multiplicative seasonality. Source [K14]

5.4 Identifying a seasonal model

We previously introduced the Box and Jenkins approach to determine the appropriate orders
of an ARIMA model. We will follow an identical methodology with target to include the
seasonal component. Therefore, some specifications in comparison with the ARIMA model
can be highlighted. The detection of seasonality is done visually through the analysis of the
time series plot of the data. We examine it for features such as trend and seasonality, since
we have gathered seasonal data, we look at the pattern across those time units to see if there
is indeed a seasonal pattern. There exist also toolboxes in R to decompose the time series
into the different patterns. Finally, the ACF and PACF plots can help identify seasonality;
for instance if there is significant seasonality, the autocorrelation plot should show spikes at

63
lags equal to the period. In case of monthly data and seasonal effect, we would expect to
see significant peaks at lag 12, 24, 36, and so on (although the intensity may decrease the
further out we go). Table 5.1 summarizes seasonal correlation patterns.

Seasonal AR(P)s Seasonal MA(Q)s Seasonal ARMA(P,Q)s


Exponentially decreasing Exponentially decreasing or
Spikes for lag Qs then
ACF or damped sine wave for damped sine wave for all lags
zero
all lags times s times s after (Q − P )s lags
Exponentially decreasing Exponentially decreasing or
Spikes for lags P s then
PACF or damped sine wave for damped sine wave for all lags
zero
all lags times s times s after (P − Q)s lags

Table 5.1: Summary of seasonal correlation patterns. Source [S15]

5.5 Experimental results

In order to design the seasonal forecasting model, raw measurements are taken from sensor 7
placed at Grand-Place. In the previous chapter, we trained the ARIMA model on an entire
day while testing the forecasting on another day. However, as mentioned in the introduction
section, seasonal ARIMA uses the fact that data contain seasonal periodic component in
addition to the correlation with recent lags. Therefore, we have to train the model on
several aggregated days to take into account the repetition that occurs every s observations.
The days were then split into two sets: Friday 22/12, Wednesday 27/12 and Friday 29/12
are used for the model development while Saturday 30/12 represents the test set for the
forecasting. We then made the assumption that each day has a repetitive behavior.

The next step focuses on the time interval that should be selected since this value fixes the
periodicity of time series and thus affects the design of the forecasting model. The main
requirement is to predict the number of attendees for the next 10 to 30 minutes such that it
gives enough time to organizers to anticipate and react to overcrowding very efficiently; for
instance, they can redirect people to less crowded streets or re-adapt the security measures.
Therefore, we decided to design a forecasting model for a time interval of 30, 20, 15 and 10
minutes. Table 5.2 summarizes for each time interval the corresponding periodicity, knowing
each day contains the crowd measurements from 9:30 to 23:30.

64
Time interval (min) 30 20 15 10
Periodicity 28 42 56 84

Table 5.2: Time intervals and their corresponding periodicity for sensor 7 (Grand-Place).
Measurements are made from 9:30 to 23:30 on December 22, 27, 29 and 30. Those four time
intervals will be used to design seasonal forecast models.

5.5.1 Model identification

The first step in model identification is to plot the time series data and examine features
like trend and seasonality. The time series may possibly have the following components,
which will affect the choice of orders d and D [WS18] :

• upward/downward linear trend and no obvious seasonality ⇒ a first order difference


d = 1 needs to be done to make the series stationary

• no trend and seasonality ⇒ differencing D at lag specified by the seasonal period s

• trend and seasonality ⇒ non-seasonal d and seasonal D differencings need to be applied

• no obvious trend nor seasonality ⇒ series may be modelled by AR, MA or ARMA


models which corresponds to d = 0 and D = 0

However, we should not go beyond two differencing as over-differencing may cause unnec-
essary levels of dependency in the time series. [KV15] Figure 5.2 displays the time series
plot corresponding to the number of unique detected MAC addresses with a time interval
of 30 minutes. We can visually state there exists a seasonal pattern with seasonality every
28 samples, meaning the time series data could be modelled using seasonal ARIMA.

65
Number of unique detected MAC addresses

800
600
400
200

0 20 40 60 80 100
Time index (30 minutes per step)

Figure 5.2: Number of unique detected MAC addresses at Grand-Place for measurements
made from 9:30 to 23:30 on December 22, 27, 29 and 30.

Mean Variance Std. dev. Minimum Maximum Median


574.6232 2688.51 163.9772 100 904 594.5

Table 5.3: Descriptive statistics of the raw measurements made at Grand-Place from 9:30
to 23:30 on December 22, 27, 29 and 30.

The Seasonal Trend Decomposition using Loess is an algorithm developed to divide up


a time series into three components namely: the trend, seasonality and remainder. The
details for achieving the decomposition can be obtained from [CCMT90]. We then made
use of this algorithm via the stl function included in R to decompose the time series into
the different patterns. The components are shown in the three panels of Figure 5.3. These
three components can be added together to reconstruct the data shown on 5.2. The seasonal
pattern clearly proves there exists periodicity in the days and also corresponds to an additive
seasonality. Besides, evening peak hours show some repetitive pattern as well as similar
variation across the days. By examining the trend pattern plot, we suggests a cyclic trend
exists in the data. The remainder component shown in the bottom panel is what is left over
when the seasonal and trend-cycle components have been subtracted from the data. We are
now going to determine an appropriate seasonal ARIMA model based on the training set,
composed of the three first days shown on Figure 5.2.

66
300
Seasonal pattern

100
−100 0
−300

0 20 40 60 80 100

Time index (30 minutes per step)


650
Trend pattern

550
450
350

0 20 40 60 80 100

Time index (30 minutes per step)


400
200
Remaining pattern

0
−200

0 20 40 60 80 100

Time index (30 minutes per step)

Figure 5.3: Decomposition of data shown on Figure 5.2 into the seasonal, trend and remain-
der patterns. Seasonal pattern plot proves there exists periodicity in the raw measurements,
with some repetitive evening peaks. 67
The ACF of the logarithm transformed measurements of the three training days, shown in
Figure 5.4, has very high positive value to large orders and then it assumes significantly high
negative values. It is also clear that, at every multiple of periodicity (28), the ACF swings
higher than the neighboring values. This sine wave pattern is consistent with the fact the
series contains a seasonal effect. [BBPS14] For the PACF, the the sine wave pattern is
weaker but significant lags before it tails off. Based on the two plots, it is difficult to tell
what orders should be fitted but the process is clearly a mix of both AR and MA orders.
0.6
0.4
ACF

0.2
0.0
−0.2

0 20 40 60 80

Lag
0.6
0.4
Partial ACF

0.2
0.0
−0.2

0 20 40 60 80

Lag

Figure 5.4: ACF and PACF of logarithm transformed training measurements (composed of
three days). ACF contains a consistent sine wave pattern, which confirms the time series
contains a seasonal effect. Blue dashed lines represent the white noise confidence interval.

68
Next step focuses on performing the necessary differencing such that we get a stationary
time series. Figure 5.3 shows trend and seasonal effects in the data therefore we chose for
one differencing in the non seasonal part d = 1 and one time differencing at the lag specified
by the seasonal period D = 1. We then looked at ADF test to determine if those two
differencing allowed to make the time series stationary. For a seasonal series, the lag order
of the ADF is important to control and must always have a number equal to or higher than
the seasonal periodicity. [AS16] Values of 5.4 assess whether time series becomes stationary
with the p-value of the ADF test being 2%.

Augmented Dickey-Fuller
Dickley-Fuller Lag order p-value
-3.7813 28 0.02379

Table 5.4: Augmented Dickley-Fuller test results for transformed data (d = 1 and D = 1).

The ACF and PACF of the differenced logarithm transformed data are plotted and shown
in Figure 5.5, to identify the suitable orders of the seasonal ARIMA model. The PACF
can give a first insight for the orders p and P of the AR model; there are two significant
non-zero autocorrelations at early lags in PACF which indicates we could have p = 3 for the
non-seasonal part, while it suggests P = 1 for the seasonal part since the lag 28 reaches an
higher value (even it remains below the noise level due to low number of samples). On the
other hand, we investigated ACF plot to determine the orders q and Q of the MA model;
the only significant lag is 1 which gives the possibility of q = 1 for the non-seasonal part.
Besides, there is almost a significant lag at 28 - just at the level of the noise which suggests
a Q = 1 process for the seasonal model.

69
0.2
0.1
0.0
ACF

−0.1
−0.3

0 20 40 60 80

Lag
0.2
0.1
Partial ACF

0.0
−0.1
−0.3

0 20 40 60 80

Lag

Figure 5.5: ACF and PACF of differentiated logarithm transformed data. The plots are
examined in order to suggest the appropriate orders for p, q, P and Q

Hence, the possible combinations of model that can be tried are :

• ACF → q = 0, 1 and Q = 0, 1

• PACF → p = 0, 1, 2, 3 and P = 0, 1

70
5.5.2 Model estimation

This part involves the estimation of model parameters φ, Φ, θ and Θ. In this work, the
maximum likelihood method is used to identify the best combination of p, P , q and Q
lags. Another consideration when modeling time series is the principle of parsimony, which
implies selecting the model with the fewest number of parameters that can adequately
describe the process. Therefore, if two different models are fitting a series equally well, the
model with less number of parameters should be preferred because estimation of parameters
will be more precise for models with less parameters and less prone to overfitting.[ST18]
The model minimizing the AIC of residuals values is finally selected.

The results of model estimation are shown in table 5.5. Since the seasonal ARIMA (1,1,1)×(0,1,1)28
model provides the smallest AIC of the residuals and implies to have few parameters, the
seasonal ARIMA(1,1,1)×(0,1,1)28 is chosen.

p d q P D Q AIC SSE σ2
1 1 1 0 1 1 20.6387 3.8559 0.07009273
1 1 2 0 1 1 22.5192 3.7885 0.06886727
2 1 2 0 1 1 23.2770 3.6808 0.0669096
2 1 1 0 1 1 22.4426 3.7385 0.06795751
2 1 0 1 1 1 25.4968 4.2245 0.07675692
3 1 1 0 1 0 24.4204 4.3197 0.07852457
3 1 1 0 1 1 23.5498 3.5955 0.0653571

Table 5.5: Results of seasonal ARIMA model estimation. The model with the lowest AIC
is selected and highlighted in bold.

ar1 ma1 sma1


0.5179 -0.8146 -0.3872

Table 5.6: Coefficients of seasonal ARIMA model(1,1,1)×(0,1,1)28 for training measure-


ments, made then from 9:30 to 23:30 on December 22, 27 and 29.

Table 5.6 contains the computed coefficients for the specific orders. The seasonal ARIMA
model can finally be written as Equations (5.6) and (5.7), where the first equation corre-
sponds to the generic representation for a seasonal ARIMA model with ar1, ma1, sma1
and s = 28, while the second equation gives the final expression (with the corresponding
coefficients).
(1 − φB)(1 − B)(1 − B 28 )yt = (1 + θB)(1 + ΘB 28 )t (5.6)

71
yt = 1.5179yt−1 − 0.5179yt−2 + yt−28 − 1.5179yt−29 (5.7)
+ 0.5179yt−30 + t − 0.8146t−1 − 0.3872t−28 + 0.3145t−29

Where

• yt = log(xt ) and xt corresponds to the number of unique detected MAC addresses

• t ~ Normal (0, 0.07)

5.5.3 Diagnostic checking

We finally had to validate the developed model with the use of residuals. As already
explained in Chapter 4, residuals correspond to what is left over after fitting the seasonal
ARIMA model and are therefore equal to the difference between the observations and the
corresponding fitted values:
et = yt − ŷt (5.8)

Residuals are useful in checking whether a model has adequately captured the information
in the data. A good forecasting method will yield residuals with the following properties
[HA14] :

• The residuals are uncorrelated. If there are correlations between residuals, then
there is information left in the residuals which should be used in computing forecasts.

• The residuals have zero mean. If the residuals have a mean other than zero, then
the forecasts are biased.

Besides, if the residuals are normally distributed, it is another insight the selected model
fits well (even if that condition is not necessary). [HA14]

Therefore, we firstly observed the ACF and PACF graphs of residuals plotted in Figure 5.6.
Since all the spikes are within the significance limits, it suggests the residuals appear to be
white noise and the proposed model fits well the data. It still should be there is an almost
significant lag in both plots, which could indicate some additional non-seasonal terms need

72
to be included in the model. However, we saw in the previous stage that increasing p and
q gave larger AIC values and made the model more complex.

0.2
0.1
ACF

0.0
−0.1
−0.2

0 20 40 60 80

Lag
0.2
0.1
Partial ACF

0.0
−0.1
−0.2

0 20 40 60 80

Lag

Figure 5.6: ACF and PACF of residuals for training measurements made at Grand-Place.
The blue dashed lines represent the white noise confidence interval.

Additionally, we run Ljung-Box and Augmented Dickley-Fuller tests. As already men-


tioned, the null hypothesis regarding the Ljung-Box test is the data are independently
distributed, while the null hypothesis for Augmented Dickley-Fuller test is the data is
non-stationary. The results for those both tests are shown in Table 5.7.

73
Ljung-Box
X-squared df p-value
27.649 28 0.4831

Augmented Dickey-Fuller
Dickley-Fuller Lag order p-value
-3.1142 5 0.03539

Table 5.7: Box-Ljung and ADF tests results for residuals of seasonal ARIMA
model(1,1,1)×(0,1,1)28 . p-value for Ljung-Box test confirms the null hypothesis can be
rejected but we cannot state the residuals have no remaining autocorrelations. Besides,
ADF test shows the residuals are stationarity for lag order 5 with a p-value of 4%.

Since the p-value for Ljung-Box test is larger than 0.05, we do not confirm null hypothesis
which is already a good point even if we cannot explicitly say the residuals have no remaining
autocorrelations. On the other hand, ADF test rejects the null hypothesis of non-stationarity
with a p-value of 4%. Besides, the mean of the residuals is ≈ -0.024.

The Q-Q plot 5.7 shows that normality for residuals is probably a reasonably good approx-
imation. Indeed if the residuals are normally distributed, the points in the QQ-normal plot
lie on a regression line. [ST18]
0.5
Sample Quantiles

0.0
−0.5
−1.0

−2 −1 0 1 2

Theoretical Quantiles

Figure 5.7: Q-Q plot allows to detect if the residuals follow the (non-necessary) normality
condition. Assuming normality for the residuals seems reasonable as most of the samples
lie on a regression line

74
We can finally state the two necessary conditions for a good forecasting model are (almost)
satisfied. Above all, the normally distributed condition of the residuals is also satisfied.

5.5.4 Prediction for sensor 7

After performing residual analysis and the diagnostic tests, ARIMA (1,1,1)×(0,1,1)28 pro-
posed based on measurements made from 9:30 to 23:30 on December 22, 27 and 29 2017 is
correctly specified. Subsequently, the estimated model can now be tested on the fourth day
with target to compare the predicted crowd flows of December Saturday 30 with the crowd
number observed on that day. The plot of the predicted and observed values with a time
interval of 30 minutes is shown in Figure 5.8.
1500

Forecast

Raw measurements
Number of unique detected MAC addresses

80% CI

95% CI
1000
500
0

0 5 10 15 20 25
Time index (30 minutes per step)

Figure 5.8: Forecast result on Saturday 30/12 from 9:30 to 23:30 at Grand-Place, with
prediction horizon of 30 minutes - based the actual sample, we predict the next sample.
The forecast is obtained by applying (1,1,1)×(0,1,1)28 . Gray and blue zones represent
respectively 95% and 80% confidence intervals.

From this figure, we can state the forecast model globally achieves to predict the behavior of
the observed measurements curve (increases and decreases). We are now going to compare
the accuracy of the forecast models in function of the time intervals mentioned in Table 5.2.
Therefore, we introduce measurement errors to compute the differences between the obser-

75
vations and predicted number of attendees; accuracy measures are compared to determine
which model gives the smallest errors. Those values may be easily derived and then allow
us to identify the model performance since it gives us an insight regarding which model is
best fit for forecasting. The error measures are MSE, MAE, MAPE and RMSE, whose the
empirical formula are given below. [LR12]

n

100 % X xt − x̂t

M AP E = (5.9)
n t=1 xt

n
1X
M SE = (xt − x̂t )2 (5.10)
n t=1
v
n
u1 X
u √
RM SE = t (x t − x̂t )2 = M SE (5.11)
n t=1
n
1X
M AE = |xt − x̂t | (5.12)
n t=1

where

• xt corresponds to the real value

• x̂t refers to the forecasted value

• n is the forecast horizon

For all the error measures, a result of zero would imply a perfect forecast while it is not
possible to obtain a negative value. Besides, MSE places relatively greater penalty on large
forecast errors. Finally, it is also important to notice MAPE measure may not be defined if
xt = 0 for any t. The MAPE scales the errors, that is to say it puts relatively more penalty
to the forecast error if the true value of the observation is small.

We firstly introduced the coefficients for the seasonal ARIMA model identified for each time
interval in Table 5.8. Then the forecasting graphs are displayed in Figures 5.9, 5.10 and
5.11, while the forecast errors are shown in Table 5.9.

76
Time interval (min) 20 15 10
ARIMA (p,d,q)×(P,D,Q)s (0,1,1)×(0,1,0)42 (1,1,1)×(0,1,1)56 (0,1,1)×(0,1,1)84
Coefficients ma1 -0.3846 ar1 0.3766 ma1 -0.4883
ma1 -0.7632 sma1 -0.3861
sma1 -0.3131
AIC -0.1298 26.7142 9.5458

Table 5.8: Seasonal ARIMA models for 20/15/10 time intervals based on measurements
made from 9:30 to 23:30 on December 22, 27 and 29 at Grand-Place. For each time interval,
the coefficients and AIC value of seasonal ARIMA models are listed.
1500

Forecast

Raw measurements
Number of unique detected MAC addresses

80% CI

95% CI
1000
500
0

0 10 20 30 40
Time index (20 minutes per step)

Figure 5.9: Forecast result on Saturday 30/12 from 9:30 to 23:30 at Grand-Place, with
prediction horizon of 20 minutes - based the actual sample, we predict the next sample.
The forecast is obtained by applying (0,1,1)×(0,1,0)42 . Gray and blue zones represent
respectively 95% and 80% confidence intervals.

77
Forecast

Raw measurements
Number of unique detected MAC addresses

80% CI
800 1000

95% CI
600
400
200
0

0 10 20 30 40 50
Time index (15 minutes per step)

Figure 5.10: Forecast result on Saturday 30/12 from 9:30 to 23:30 at Grand-Place, with
prediction horizon of 15 minutes - based the actual sample, we predict the next sample.
The forecast is obtained by applying (1,1,1)×(0,1,1)56 . Gray and blue zones represent
respectively 95% and 80% confidence intervals.
1500

Forecast

Raw measurements
Number of unique detected MAC addresses

80% CI

95% CI
1000
500
0

0 20 40 60 80
Time index (10 minutes per step)

Figure 5.11: Forecast result on Saturday 30/12 from 9:30 to 23:30 at Grand-Place, with
prediction horizon of 10 minutes - based the actual sample, we predict the next sample.
The forecast is obtained by applying (0,1,1)×(0,1,1)84 . Gray and blue zones represent
respectively 95% and 80% confidence intervals.

78
Time interval (min) 30 20 15 10
MAPE (%) 13.7707 16.4736 12.6455 12.4045
MSE 8467.9543 13660.9048 7310.5431 8569.5079
MAE 71.5980 87.3504 68.5171 70.5394
RMSE 92.0215 116.8799 85.50171 92.5716

Table 5.9: Errors between the predicted and observed measurements made on Saturday
30/12 from 9:30 to 23:30 at Grand-Place.

The first insight to check the forecast accuracy is given by the MAPE. In 1982, Lewis drew
up a table similar to Table 5.10, [AS16] which contains the range of MAPE values and their
interpretation in order to help judging the developed model accuracy.

MAPE (%) Interpretation


< 10 Highly accurate forecasting
10 − 20 Good forecasting
20 − 50 Reasonable forecasting
> 50 Inaccurate forecasting

Table 5.10: A scale of judgment for forecast accuracy. Source [AS16]

Therefore, any forecast with a MAPE value smaller than 10 % can be considered highly
accurate, 11–20 % is good, 21–50 % is reasonable and 51 % or more is inaccurate. Since
observations of crowd density vary from a few hundred attendees per hour in periods, MAPE
in the range of 10–20 % for forecast accuracy can be seen as globally acceptable, meaning
all the time intervals are suitable for forecasting. Besides, it can be seen on Figures 5.8, 5.10
and 5.11 that, the predicted flow values fit pretty well with the observed flows during both
peak and off peak hours, thus indicating the good performance of the model, developed using
only three days as input data. However, the 20 minutes time interval has some difficulties to
follow all the peaks and some raw measurements are not even within the confidence intervals,
which suggests it is not the most appropriate forecast model. Based on the error measures
results and plots, it can be concluded forecasting with disparate days (predicting Saturday
based on previous Friday, Wednesday and Friday) still provides adequate results in the
context of Plaisirs d’Hiver however, more accuracy could be achieved by using consecutive
days for predicting the next 24 h ahead number of attendees. Analyzing the other error
measures allows us to derive the same conclusions.

79
5.6 Comparison with another sensor

We have trained the seasonal ARIMA model for sensor 7 (Grand-Place) with the previous
Friday, Wednesday and Friday to predict the Saturday, based on the hypothesis each day
exhibits the same seasonal pattern. However, the use of normal week days as input may not
be appropriate if we want to forecast the number of people during weekend (Saturday and/or
Sunday), as the crowd density during weekends could be quite different when compared to
the normal week days. Therefore, in such cases, taking into account the previous weeks
days as input data could be considered to better capture the crowd density pattern.

The purpose of this section is then to evaluate if there exists an improvement in prediction
results when considering consecutive week days to predict another week day or, similarly,
taking into account weekend days to forecast another day of the weekend. In both cases,
measurements are extracted from 3:00 to 14:00 at Place de la Bourse (sensor 1) with time
intervals of 30 and 10 minutes. In the first scenario, we use data from Monday 25/12,
Wednesday 27/12 and Thursday 28/12, and test the developed forecast model on Friday
29/12. Secondly, measurements made on Saturday 23/12, Sunday 24/12, Saturday 30/12
are considered as training inputs to predict the number of attendees on Sunday 31/12.

We firstly worked on Scenario 1; we introduce the coefficients for the seasonal ARIMA
model identified for each time interval in Table 5.11. The forecasting graphs are then
displayed in Figure 5.12, while the forecast errors are shown in Table 5.12.

Time interval (min) 30 10


ARIMA (p,d,q)×(P,D,Q)s (1,1,1)×(0,1,1)22 (0,1,1)×(1,1,0)66
Coefficients ar1 -0.4949 ma1 -0.6386
ma1 0.2660 sar1 -0.5291
sma1 -0.997
AIC -41.1968 -82.6407

Table 5.11: Scenario 1 (week days) - Seasonal ARIMA models for 30 and 10 time intervals
based on measurements made from 3:00 to 14:00 on December 25, 27 and 28 at Place de la
Bourse. For each time interval, the coefficients and AIC value of seasonal ARIMA models
are listed.

80
100 150 200 250 300 350
Forecast
Number of unique detected MAC addresses

Raw measurements

80% CI

95% CI
50
0

5 10 15 20
Time index (30 minutes per step)

(a) Forecast result from 3:00 to 14:00 on Friday 29/12 at Place de la Bourse with prediction horizon
of 30 minutes

Forecast
Number of unique detected MAC addresses

Raw measurements
300

80% CI

95% CI
200
100
0

0 10 20 30 40 50 60
Time index (10 minutes per step)

(b) Forecast result from 3:00 to 14:00 on Friday 29/12 at Place de la Bourse with prediction horizon
of 10 minutes

Figure 5.12: Scenario 1 (week days) - Forecast results from 3:00 to 14:00 on Friday 29/12 at
Place de la Bourse, respectively with prediction horizon of 30 minutes and 10 minutes. Fore-
cast models built on measurements made from 3:00 to 14:00 on Monday 25/12, Wednesday
27/12 and Thursday 28/12.

81
Time interval (min) 30 10
MAPE (%) 16.6457 14.5834
MSE 548.4136 303.3901
MAE 17.0386 12.4719
RMSE 23.4182 17.4181

Table 5.12: Scenario 1 (week days) - Errors between the predicted and observed measure-
ments made on Friday 29/12 from 3:00 to 14:00 at Place de la Bourse.

The next step focused on determining seasonal ARIMA model for Scenario 2. As for the
previous case, we firstly introduce the coefficients for the seasonal ARIMA model identified
for each time interval in Table 5.13. The forecasting graphs are then displayed in Figure
5.13, while the forecast errors are shown in Table 5.14.

Time interval (min) 30 10


ARIMA (p,d,q)×(P,D,Q)s (1,1,1)×(0,1,1)22 (0,1,1)×(1,1,0)66
Coefficients ar1 -0.6362 ma1 -0.2629
ma1 0.7838 sar1 -0.5075
sma1 -0.997
AIC 2.6549 10.3267

Table 5.13: Scenario 2 (weekend days) - Seasonal ARIMA models for time intervals of 30
and 10 minutes, based on measurements made from 3:00 to 14:00 on December 23, 24 and
30 at Place de la Bourse. For each time interval, the coefficients and AIC value of seasonal
ARIMA models are listed.

82
600
Forecast
Number of unique detected MAC addresses

Raw measurements
500

80% CI

95% CI
400
300
200
100
0

5 10 15 20
Time index (30 minutes per step)

(a) Forecast result on Sunday 31/12 at Place de la Bourse with prediction horizon of 30 minutes
500

Forecast
Number of unique detected MAC addresses

Raw measurements
400

80% CI

95% CI
300
200
100
0

0 10 20 30 40 50 60
Time index (10 minutes per step)

(b) Forecast result on Sunday 31/12 at Place de la Bourse with prediction horizon of 10 minutes

Figure 5.13: Scenario 2 (weekend days) - Prediction results on Sunday 31/12 at Place de la
Bourse for time intervals of 30 and 10 minutes. Forecast built on measurements made from
3:00 to 14:00 on Monday 25/12, Wednesday 27/12 and Thursday 28/12.

83
Time interval (min) 30 10
MAPE (%) 20.4842 14.1366
MSE 647.5879 516.01235
MAE 19.8002 15.5193
RMSE 25.4478 22.7159

Table 5.14: Scenario 2 (weekend days) - Errors between the predicted and observed mea-
surements made on Saturday 30/12 from 3:00 to 14:00 at Grand-Place.

Firstly, we can assess in both cases (sensors 1 and 7) shorter time intervals give smaller
measures errors between the observed and the predicted data. Globally, all MAPE values
are in the range 10-20%, meaning the forecast models are good for forecasting. Those results
suggest there is no improvements in the accuracy of the forecasting when dissociating days
of the week and weekend, but in practice, nothing can actually be concluded; to derive a
seasonal ARIMA model from a time series, we had to concatenate several days of the week
and respectively weekend. Unfortunately, due to the limited amount of measurements, it
was only possible to aggregate days from 3:00 to 14:00, whose the hours were then not rep-
resentative of the crowd density - the event is held from 12:00 to 22:00, except on December
24, 31 and days off for which the event is held from 12:00 to 18:00. Those moments are
quiet and we were then not able to capture enough significant peaks. On the other hand,
we noticed the gathered measurements did not exhibit the same pattern. Therefore, we
decided to smooth the input data by taking the average of samples in the specified time
interval instead of keeping only the sample at the time interval and then dropping out the
samples in between. To assess the way of performing, we computed the forecasting in case
of average the input data and without averaging. We then obtained smaller errors measures
and the confidence intervals were smaller when averaging the input data. Those both re-
marks allowed us to use the average to improve the forecast performances; unfortunately,
the forecast results were still degraded. Therefore, the experiment should be redone on
another edition in order to be able to derive strong conclusions regarding the assumptions
of relationship between days.

84
5.7 Comments on seasonal ARIMA models

The purpose of time series models was to identify the pattern in the past data by decompos-
ing the seasonal as well as trend patterns and extrapolate that pattern into the future. Since
we visually distinguished seasonal pattern in crowd graphs due to peaks and peaks-off that
were repeating more or less on the same time every day, we assumed seasonal ARIMA mod-
els could be particularly relevant to model crowd behavior. Besides, seasonal ARIMA model
is used in predicting the traffic flow and, in many studies, it is found that seasonal ARIMA
performs better than simple ARIMA. [KV15] Therefore, the purpose of this chapter was
to determine if seasonality might either or not improve the forecast performances in the
particular context of crowd movements - monitoring crowd during major events organized
by the city of Brussels.

We firstly designed a seasonal ARIMA model for sensor 7 (Grand-Place), based on three
training week days while we tested the developed seasonal ARIMA model on a weekend day.
The results were promising since the forecast models could globally predict the increases/de-
creases and had a good accuracy according to the errors measures (and particularly the table
drew up by Lewis). Besides, a prediction horizon of 30 minutes still provides satisfying fore-
cast results. Afterwards, we decided to develop another model for sensor 1 (Place de la
Bourse) with target to determine if we could achieve better results by training and test-
ing exclusively on week days (respectively weekend days). Unfortunately we were not able
to derive strong conclusions since due to limited amount of measurements we gathered on
Plaisirs d’Hiver 2017, we had to take into account data which were not representative of the
crowd density.

By all those points, we can assess seasonal ARIMA models suffer from drawbacks. Firstly,
they require larger historical databases for model development than in the ARIMA case
since they have to take into account the periodicity in several days. As a result, the use
of seasonal ARIMA may be restricted. For instance, in the case of traffic flow prediction,
[KV15] reported :

• Smith, Williams and Oswald used previous 45 days of 15 minutes flow observations
for the next day traffic flow forecasting.

85
• Williams and Hoel used more than 2 months of traffic volume observations

• Around 60,000 flow observations aggregated for each 3 minutes intervals spanned over
a period of 106 days were used by Stathopoulos and Karlaftis.

• Ghosh, Basu and Mahony used 20 days of 15 minutes flow data with a total of 1920
observations.

• Mai, Ghosh and Wilson used 15 min aggregated traffic volume observations over a
period of 26 days for fitting the seasonal ARIMA based traffic flow prediction model.

• Lippi, Bertini and Frasconi used 4 months of flow data from loop detectors placed
around nine districts of California for model development using seasonal ARIMA.

Even if limited input data (as for sensor 7) allow us to already achieve globally good results,
it shows collecting high amount of data is firstly required to accurately design a seasonal
ARIMA model. In the case of Plaisirs d’Hiver 2017, we performed disparate measurements
from Friday 22/12 to Sunday 31/12. Therefore, it would be relevant to make raw mea-
surements for at least two consecutive weeks, which could lead to more robust predictions.
Consequently, the seasonal ARIMA model would then be appropriate for longer events dur-
ing which we would have enough days to assess the crowd behavior, like for Plaisirs d’Hiver.
Furthermore with additional data, it would be possible to determine how many previous
days are actually necessary to accurately forecast crowd movements. Another pointed dif-
ficulty is the interpretation of the seasonality. Right now, we suggested data exhibit each
day the same pattern. But there could also exist seasonality by week, for instance Monday
would be similar to the next Monday. Having enough data to check what is more appropriate
seasonality could also help in increasing the forecast accuracy.

A second disadvantage is the time series generally needs to exhibit a strong seasonal pattern
to reach accurate forecast results with seasonal ARIMA models. Indeed, the results for
sensor 1 (Place de la Bourse) were degraded since the peaks of the test day were not similar
to the ones of the training days, which degrades the forecast performances.

86
6. Feedforward Neural Networks
This chapter focuses on the basics of neural networks and what it really is. Afterwards,
several activation functions are discussed and their differences explained. Then, multiple
optimization methods are laid out and an argumentation is given to explain why a particular
optimizer is chosen as part of this neural network structure. Next, to construct the most
accurate feedforward network, its hyperparameters are configured by doing a grid search.
The outputs of the neural network are then compared with each other in function of several
structural network changes. At the end, a final result is shown and proven to provide the
most accurate results of a feedforward network with our data.

6.1 Introduction to neural networks

A neural network is composed of one or more layers consisting of a number of neurons.


These neurons are units which take inputs (s1 , ..., sN ) and present a decision output based
on its weights (w1 , ..., wN ). According to [NN1], each weight is assigned to one input value
of the neuron. This can also be called a perceptron whereby its basic schematic is given in
Figure 6.2.

1
Figure 6.1: Perceptron structure.

The assigned weights are used to make a selection of inputs more significant than others.
1
[online] Available at: http://neuralnetworksanddeeplearning.com/images/tikz0.png [Accessed 21 May
2018].

87
The processing unit is as simple as a weighted sum

Σi wi si (6.1)

and depending on whether the result of this sum is above a certain threshold, the output
will either generate a 1 or a 0. By combining these perceptrons in layers and stacking them,
a network is made.

2
Figure 6.2: Network of perceptrons.

The bias b of a perceptron is defined as

b ≡ −threshold (6.2)

which allows the perceptron to check whether the result of the weighted sum (Eq. 6.1) is
positive of negative rather than checking whether it is above the threshold. In other words,
the higher the bias of a neuron, the more likely it will generate a 1 as its output. The
weights and biases of neurons are quite interesting as it enables the network to adjust those
parameters itself given any input data.

6.2 Activation functions

In order to control the learning process much better, sigmoid neurons are used, this function
is namely differentiable. Similar as normal perceptrons, the change lies in the fact that the
2
[online] Available at: http://neuralnetworksanddeeplearning.com/images/tikz1.png [Accessed 21 May
2018].

88
sigmoid neuron has an input range of [0, 1] (it can also take values in between) instead of
just 0 or 1. The output of the sigmoid neuron is given by

1
σ(z) ≡ (6.3)
1 + e−z

with z = Σi wi si + b. A depiction of the sigmoid function σ(z) is presented in Figure 6.3.


The output range is thus any value in the interval [0, 1].

0.9

0.8

0.7

0.6
sigmoid

0.5

0.4

0.3

0.2

0.1

0
-10 -8 -6 -4 -2 0 2 4 6 8 10
x

Figure 6.3: Sigmoid function

The reason why the sigmoid activation function is better to use than perceptrons is because
of the smoothness of the function. Small changes on the input will equivalently produce
small changes on the output ([NN1]) expressed as

∂o ∂o
∆o ≈ Σj ∆wj + ∆b (6.4)
∂wj ∂b

whereby o denotes the output of the neuron. A variation of the sigmoid function is the hard
sigmoid function. Figure 6.4 displays a comparison of the sigmoid, hard sigmoid and ultra
fast sigmoid functions. The major difference is that the sigmoid function is not such a simple
formula to compute and therefore approximations of the sigmoid function are used. The
ultra fast sigmoid is the sigmoid function but interpolated linearly between several points.
The accuracy error that is made by making such an approximation is insignificant as the
graph show.

89
Figure 6.4: Hard sigmoid function. Source [hardsigmoid]

However, these accuracy errors increase if further approximations are made. The hard
sigmoid uses only two points from where it linearly interpolates and thus generates more
errors. In practice this will not have a great effect as the network will automatically correct
itself when it is more trained. The great benefit of this is that the hard sigmoid function
is very easy and fast to compute so that the training process will be a lot faster compared
to training with sigmoid activations. Other activation functions can be chosen as well, for
example the tanh function.

0.8

0.6

0.4

0.2
tanh

-0.2

-0.4

-0.6

-0.8

-1
-10 -8 -6 -4 -2 0 2 4 6 8 10
x

Figure 6.5: Hyperbolic tangent function

This can actually be seen as a scaled sigmoid function for which the output range is given by

90
[−1, 1]. Since this function has a rather steep slope in a short range, the output of a network
will not be improved by adding multiple layers with this activation (since flat derivatives
are not optimal to use in a gradient descent algorithm, causing vanishing gradient issues).
Another interesting type is the Rectified Linear Unit (ReLU) function depicted in Figure
6.6. As can be seen, negative inputs will be mapped to a 0 and positive inputs will have the
same output after having passed through the function. Mathematically this means that

f (z) = max(0, z) (6.5)

and visualized as

10

6
Relu

0
-10 -8 -6 -4 -2 0 2 4 6 8 10
x

Figure 6.6: Rectified Linear Unit function

The ReLU function is a very straightforward function but combined with the use of bias
terms in the network it models non-linearities quite well by changing its slope at different
positions [NN2]. However to obtain acceptable results, multiple hidden layers are needed.
Apart from ReLU, another version, the leaky ReLU function is available for usage. The
only difference between this function and the original is that negative values are mapped to
values corresponding with a (small) constant slope instead of 0.

f (z) = max(αz, z) (6.6)

Expression (6.6) is the same as Definition (6.6) with the difference of introducing a new

91
parameter α denoting the slope in the negative area of the function.
Take note that when a neuron unit does not have any input, the weighted sum of its inputs
will be 0 and it can only output the bias term (which will be passed through an activation
function). The first layer of a neural network is called the input layer, while the last layer is
termed output layer. Any layers in between those are called hidden layers as the inputs or
outputs of such layers are not clearly visible. Of course, adding more neurons and hidden
layers will improve the accuracy of the neural network model (if properly trained) but it
will also significantly increase the time needed to train or test the network.

6.3 Optimization methods

There can be made a classification in neural networks based on whether they only forward
the outputs of the current layer to the next or whether those outputs are fed back to previous
layers and thus introducing loops. Feedforward networks only pass outputs to the next layer
while recurrent networks consist of feedback loops (more details in the next section). A
network is called a deep network when it consists of at least one hidden layer.
To train a neural network, training data is used. This usually consists in taking 60 − 80%
of the total amount of data. The rest can be used as test data. To visualize the error the
network makes with regards to the real data, a cost (loss) function must be defined.

1
C(w, b) ≡ Σx ||y(s) − o||2 (6.7)
2n

As previously defined, w and b are the weights and biases respectively. The output of each
neuron where s was the input is given by o and n is the total amount of training points.
The real data is given by y(s). The network output depends on many factors but is denoted
by o for sake of simplicity. Minimizing this cost function will result in a lower overall error
since the outputs would lay closer to the exact data. One of the ways to minimize this to
get the optimal weight and bias values is to use gradient descent. The gradient operator of

92
a function C with respect to (parameters) x1 and x2 is defined as

∂C ∂C T
∇C ≡ ( , ) (6.8)
∂x1 ∂x2

To optimize the parameters with each iteration, they can be corrected using

xk+1
i = xki − η∇C (6.9)

where the parameter xi at iteration k + 1 is updated (corrected) using the value of the
previous iteration k and subtracted by the gradient multiplied with η. Basically we are
moving in the opposite direction the gradient is pointing towards, eventually leading towards
convergence. The learning rate η should be chosen appropriately since wrong values can
lead to a worse performance or even over-shooting problems so that no convergence can be
found. The smaller the learning rate, the smaller steps will be taken towards to optimize
the parameter (but we will be sure to reach convergence) while on the other hand large
values induce big gradient steps which could lead to immediate convergence but has also
the potential to jump over the optimal solution and thus never converge.
To identify the most optimal value to use for the learning, the network can be trained
with different learning rate values. A parameter sweep of the learning rate (increased in an
exponential fashion per iteration) is given by Figure 6.7.

3
Figure 6.7: Learning rate in function of iterations.

93
Here, iteration means for every parameter value of the learning rate.

4
Figure 6.8: Loss in function of learning rate.

Visualizing the loss in function of learning rate for every iteration of the sweep presents a
result like in Figure 6.8. After every iteration the loss function will decrease as errors are
minimized (and consequently the learning rate is increased). At first this is indeed the case
but after a certain iteration the learning rate value becomes too high to even converge. The
optimal learning rate can be deduced by selecting the value that corresponds to the lowest
loss value.
Selecting the learning rate value is of high importance regarding stochastic gradient descent
but there exist also other optimizer methods that allows the learning rate to be dynamically
adapted as the network is training.

6.3.1 Configuring the hyperparameters

Several experiments have been conducted in this type of simple feedforward neural network.
A feedforward network can be designed by stacking layers of neurons on top of each other
and ensuring that the connections between layers only go in one direction (towards the
output). A choice of hyperparameter values must first be made, such as:
3
[online] Available at: https://cdn-images-1.medium.com/max/1600/1*zgm3iy7aD4ZsXLiva0xtFg.png
[Accessed 21 May 2018].
4
[online] Available at: https://cdn-images-1.medium.com/max/1600/1*HVj_4LWemjvOWv-
cQO9y9g.png [Accessed 21 May 2018].

94
• learning rate

• amount of hidden layers

• amount of neurons per layer

• activation function type per layer

• include bias vector per layer

• train/test/validation data ratio

• loss function

• optimization method

For neural networks, the training data that was used consists of the data gathered by
sensors 1, 2, 4, 5 and 7 from days 22 to 29. The test data is the data from these sensors
gathered on day 31 and the validation data from day 30. Regarding optimization techniques,
Stochastic Gradient Descent (SGD) is often used and will be explained in more detail later.
An extension to SGD is the momentum method. To correct parameters, momentum also
takes the value of past updates into account, this helps to speed up the process in every
dimension: [ADADELTA].
∆xi = ρ∆xi−1 − ηgi (6.10)

xk+1
i = xki + ∆xi (6.11)

Whereby gi stands for the gradient of parameter xi and ∆xi is its correction. A constant
ρ is used which serves as a decay to take into account past updates. Another parameter
update technique is the Adagrad method. This method corrects parameters xi (at iteration
k + 1) by using
η
xk+1
i = xki − q gi (6.12)
Σj=1 gi2

The Adadelta method is influenced from the Adagrad method but it tackles two prob-
lems that the latter method faces. There is namely a constant decay of the learning rate
value whilst training and a learning rate has to be chosen. Ideally, we want the net-
work to automatically configure its own learning rate. Based on this (as is presented in

95
[ADADELTA]), the Adadelta method first computes the parameter gradient, uses the
running average (E[X]) on the corresponding gradient (Eq. 6.13) and eventually a factor
of the RMS value of the updated divided with the RMS value of the gradient to update
parameters:
E[g 2 ]i = ρE[g 2 ]i−1 + (1 − ρ)gi2 (6.13)

with ρ being the decay constant. Then, we define

q
RM S[g]i = E[g 2 ]i +  (6.14)

with  being another constant to produce similar effects as Equation (6.12) does. To update
the parameter, Adadelta uses the following formula:

RM S[∆x]k−1
xk+1
i = xki − gk (6.15)
RM S[g]k

Figure 6.9 depicts how well Adadelta performs compared to the other optimization methods.

Figure 6.9: Comparison of different optimizer algorithms based on the MNIST digit classi-
fication for 50 epochs. Source [ADADELTA]

For all the feedforward neural networks results, the loss function that was used is MAE and
the Adadelta optimizer method. This is due to the fact that this method ensure that the

96
learning rate dynamically adapts itself.
The learning rate starts off with a high value but reduces consistently as the iterations
increase. This technique performs produces acceptable results in practice ([ADADELTA])
as it utilizes second order derivative information through use of decaying the RMS ratio.
This is actually an approximation of the diagonal Hessian matrix (contains second derivative
information). In the following sections we will make use of some terminology like epoch and
batch size. One epoch denotes that all the training samples have passed through the network
with one forward pass and one backward pass. The batch size related to that are all the
training samples in that one pass. The more epochs the network undergoes, the more trained
it will be and hence provide a better accuracy on the model. However, high epochs entail
high training times. The batch size, on the other hand, is a hyperparameter and thus has
to be configured (by doing a grid search).

6.4 Experiments on time series data

6.4.1 Keras

All the neural networks that have been investigated in this thesis make use of the Keras deep
learning Python library 5 . This ensures that networks can be implemented very fast using
high-level commands. It has vast extensions with regards to recurrent neural networks and
is very well documented. The way to build neural networks in Keras is to define a sequential
model and then add several neural layers to it. Before compiling the network structure, the
optimizer has to be configured, and a loss function has to be specified. After that, the model
will be trained with the training and validation data for a number of epochs. When this is
over, the model can be used to predict the data. Based on how accurate the prediction is,
we can evaluate the performance of our model.

6.4.2 Accuracy

The means to measure how well the output of a neural network fits the data is done through
accuracy in this section. The accuracy is measured through the calculation of the R squared
5
[KERAS] [online]https://keras.io/ [Accessed 21 May 2018]

97
method.
σE2
R2 = (6.16)
σT2

In Equation 6.16, σE2 stands for the expected variance while σT2 denotes the total variance
[SSPSK13]. The output is a percentage visualizing how close the data is to the model of
the network. The R2 score can also be called the coefficient of determination.

6.4.3 Results

Before making a neural network structure to forecast data, a grid search is defined to
configure/tune the hyperparameters. This will aid in identifying which model would be most
suitable to use on this specific data we have gathered in our thesis. The accuracy of the
model will be compared with different parameters. Several tests were made to understand
the best optimal amount of neurons (in the hidden layer) that have to be chosen with respect
to different activation functions. This is plotted in Figure 6.10. Note that the grid search
has been made using one hidden layer and by changing parameters of that one layer.

Figure 6.10: Accuracy of different activation functions in function of neurons

The first thing to notice is that the hard sigmoid activation function returns an accuracy
of 1 regardless of how many neurons are present in the hidden layer. This seems strange
since it does not quite follow the behavior of the other activation curves, and therefore

98
more attention will be given to other activations (this can be due to the fact that all the
layers have been set to the same activation function). Next we can notice that the accuracy
is higher with few neurons in the hidden layer for all activation functions except for the
ReLU function. With ReLU, the accuracy increases then decerases beyond 50 neurons
approximately.

6.4.4 Batch size

Larger batch sizes tend to produce better results but have a negative impact on the rate of
convergence. If, for example, the SGD method converges after T iterations, then mini-batch
T
SGD of batch size b will not converge after b
iterations but will need more. Thus, a trade-off
has to be made. According to [minisgd], SGD reaches convergence governed by Formula
(6.17).
1
√ (6.17)
T

While for mini-batch SGD this is given by Formula (6.18).

1 1
√ + (6.18)
bT T

We can thus see that the speed of convergence decreases as the batch size b increases. A
grid search applied on different batch sizes revealed that a batch size of 32 presented us
with the best accuracy and a good convergence.

6.4.5 Dropout

As can be often the case with neural network, overfitting can happen due to having too
many parameters. One method to counteract this issue is by dropping a percentage of
parameters. In other words, any impact from several random neurons will be discarded
on the forward pass and their corresponding weights are not updated during the backward
pass [DRPWGT]. One way to see this is that by dropping some neurons, other neurons
will have to take over their role. This will ensure that the network is more generalized and
counteracts the overfitting phenomenon [DRPWGT]. To make the dropout method more

99
effective, constraining the norm of the weights has proven to be very beneficial in practice
[SHK14]. This means that the weights are limited to a certain maximum value, also known
as max-norm regularization. To see the effects of this dropout method, a grid search has
been performed to identify the curve with the highest accuracy.

Figure 6.11: Accuracy of different dropout rates corresponding to a weight constraint

In Figure 6.11 the accuracy is given for several dropout percentages constrained by different
weight factors. For our time series data, a dropout rate of 20% with a weight constraint of
4 presents us with the most optimal accuracy.

Using this range of hyperparameters, several results were made. Figure 6.12 depicts the
output of a simple feedforward neural network consisting of an input layer of 15 neurons, a
hidden layer of 50 neurons and an output layer of 5 neurons (all using a ReLU activation
function). The Adadelta optimizer is used and as loss function the MAE. It is important to
note that the Adadelta optimizer is initalized with the following values to achieve the most
optimal convergence:

• learning rate η: 1

• decay constant ρ: 0.95

• : None

100
• learning rate decay after every update: 0

Figure 6.12: Feedforward network with a prediction horizon of 10 minutes, with 1 hidden
layer and 50 neurons

At first it looks like the network fits the data accurately but when we zoom in it is clear that
the network exhibits a lagging behavior. When spikes occur in the data (sudden increase
of visitors), the feedforward neural network is not able to mimic (predict) this steep spike
and therefore presents errors up to 20%. Although it is not necessarily unsatisfactory that
the network does not keep up with steep spikes as these variations are mainly caused by a
difference in beacon time interval signals.

101
Figure 6.13: Feedforward network with a prediction horizon of 10 minutes, with 1 hidden
layer and 50 neurons, zoomed in

The cost function that was used is MAE. Its losses are given by Figure 6.14 where it is
visible that the loss values are unsatisfactory. Of course the losses decrease as there are more
iterations/epochs, but at some point it will not decrease any further and stay constant. The
chosen optimizer is Adadelta, this is due to a better performance compared to SGD and the
other optimization techniques explained previously.

Figure 6.14: Losses of feedforward NN with a prediction horizon of 10 minutes and with 1
hidden layer

102
Figure 6.14 displays the character we have anticipated from the theory: the losses decrease
per epoch we increase. From a certain epoch value, the errors remain constant and there is
no need in further attempting to optimize the network by increasing the iterations. What is
to be noted is the fact that there are small oscillations present in the loss curve with respect
to the validation data.

Layer Neurons Parameters


Dense input 15 15
Hidden layer 50 750
Dense output 5 3750
Total amount of parameters 4515

Table 6.1: Network Structure of feedforward network with 1 hidden layer

Table 6.1 describes the feedforward neural network structure corresponding to Figures 6.12,
6.13 and 6.14. Another plot (Fig. 6.15) has been made where the impact of multiple hidden
layers is analyzed. The network structure with 2 and 3 hidden layers is displayed in Tables
6.2 and 6.3.

Layer Neurons Parameters


Dense input 15 15
Hidden layer 50 750
Hidden layer 50 2500
Dense output 5 3750
Total amount of parameters 7015

Table 6.2: Network Structure of feedforward network with 2 hidden layers

Layer Neurons Parameters


Dense input 15 15
Hidden layer 50 750
Hidden layer 50 2500
Hidden layer 50 2500
Dense output 5 3750
Total amount of parameters 9515

Table 6.3: Network Structure of feedforward network with 3 hidden layers

103
The loss rate of the network with 1 hidden layer reaches higher loss values compared to
the network implemented with 2 hidden layers, as can be seen in Table 6.4. In the case of
the network operating with 3 hidden layers, the losses increase explosively compared to the
other cases. These simulations have been conducted for 70 epochs.

Hidden layers 1 2 3
MAE 36 29 148

Table 6.4: MAE losses of a feedforward neural network for different amount of hidden layers

These tests have been made up till a number of epochs to compare the evolution of the loss
function with regards to each other (Figure 6.15).

Figure 6.15: Comparison of feedforward neural network with a prediction horizon of 10


minutes and with multiple hidden layers

Interestingly, the accuracy of the forecast increases as the number of hidden layers also
increase (from layer 1 to layer 2). However adding too many layers (a third hidden layer)
seems to increase the loss function and hence drastically decrease the accuracy. The impact
of adding more neurons has been analyzed as well. Several network structures have been
made. Figure 6.16 displays the loss function of each structure with a different amount of
neurons in their hidden layer.

104
Figure 6.16: Losses of FNN with increasing neurons

It is clear that increasing the number of neurons also causes a decrease in the loss function
and thus better forecasts. For a feedforward neural network of 100 neurons in its hidden
layer, the forecast of one day is given by Figure 6.18.

105
Figure 6.17: FNN forecast with 100 neurons in the hidden layer and a prediction horizon of
10 minutes

Zooming in, we can better see the error the model is making with regards to the real raw
(test) data.

106
Figure 6.18: FNN forecast with 100 neurons in the hidden layer and a prediction horizon of
10 minutes zoomed in

Looking at the plots and the losses we can see that this network is off by a smaller loss
compared to the network with a fewer amount of neurons (as selected by the grid search).
Compared to Figure 6.13, using 100 neurons in the hidden layer delivers a better result: the
lagging behavior has been diminished (although it is still present to a certain extent).

Number of neurons 50 100


MAE 29 21

Table 6.5: MAE losses of a feedforward neural network with 2 hidden layers, a prediction
horizon of 10 minutes for different amounts of neurons (in the hidden layer)

This effect is made manifest through the MAE loss function in Table 6.5, where different
losses are shown corresponding to a certain amount of neurons in the hidden layers. Two
hidden layers are used with the same number of neurons. We can see that the losses are
lower with an increasing amount of neurons present in the hidden layers.

107
6.5 Discussion

The results given by building a feedforward neural network to forecast time series data are
quite satisfactory when the hyperparameters are tuned correctly. The model can lead to
great errors when those configurations are not correct and therefore preparing/tuning the
network before training is of utmost importance. With a prediction horizon of 10 minutes,
the first results achieved an MAE value which was higher than the other simulations even
though it was constructed with the configured hyperparameters. Increasing the amount
of hidden layers to two aided in lowering this error. By slightly deviating a bit from the
hyperparameters that the grid search provided and adding a higher number of neurons we
received a better result.

108
7. Recurrent Neural Networks
Recurrent Neural Networks (RNNs) have been proven to be very useful in recent years.
They are commonly used in prediction, forecasting and translation domains (such as speech
recognition [GAH13], music generation [DJ02], sentiment classification [LFC14], earth-
quake prediction and weather forecasting). In this chapter, instead of using feedforward
neural networks, we study RNNs, wherein the hidden units form a cycle. It is a good tech-
nique for analyzing time series data such as financial stocks HHY11, machine translation
KBC14, company profits, etc.

As explained in Chapter 3, time series is a sequence of values x1 , x2 , ..., xt−1 , xt . And in


Chapter 4, we dealt with a set of time series, which is the measurement collected by sensors at
different locations during Plaisirs d’Hiver. From the work of Hochreiter S and Schmidhuber
J SJ97, we see that RNNs are not good at managing states. The main problem that
RNNs are facing is vanishing/exploding gradients, and the widely used solutions are Long
Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). In this chapter, based on
received measurements, recurrent neural networks with LSTM and GRU are implemented
to maintain important information through a long period. Once that information is used, it
will update a new state for neural networks. We aim to forecast the number of visitors in the
next 10 minutes. Prediction results are analyzed with respect to different neural network
models, followed by a discussion at the end.

7.1 Recurrent Neural Network Models

Unlike the other neural network models, where the activation flows only from the input layer
to the output layer, RNNs have connections between each unit in an network layer. Figure
7.1 illustrates the basic structure of RNNs. They receive inputs at each time step, and
produce prediction outputs. The cyclic arrow indicates that hidden layers are connected to
each other. The RNNs have input-to-hidden connections parametrized by a weight matrix

109
Wax , hidden-to-hidden connections parametrized by a weight matrix Waa , hidden-to-output
connections parametrized by a weight matrix Wya .

Figure 7.1: RNNs basic structure

Besides, another representation of RNNs is shown in Figure 7.2, it is an unrolled RNNs


structure through time. The weight matrices Wax , Waa and Wya are initialized before
propagation and remain unchanged during one forward propagation, from the beginning
time step till the end. The forward propagation starts with an initial state a0 . Suppose the
input and output sequences are both of size T . At each time step t, the neuron receives
input xt as well as activation at−1 from the previous step. The loss is then computed based
on predicted output ŷt and target yt . The backward propagation uses the sum of losses from
all time steps to compute gradients with regard to parameters, then it propagates back from
right to the left in Figure 7.2. So we need to iterate through all the time steps from the end
till the starting point. At each time step, we increment the overall gradient of Wax , Waa
and Wya , and eventually the weight matrices get updated. In this structure, the input and
output sequences have the same length, while it is also possible to have models for which
the length of the output sequence is different than that of the input.

110
Figure 7.2: RNNs unrolled structure

As mentioned above, there exist other types of RNNs, which are shown in Figure 7.3. The
first type takes only one input at the first time step, then it continuously produces out a
sequence of output, this is a vector-to-sequence network. This type of network can be used
to do music generation, where a musical style is given and RNNs generate new melody.
[DJ02]

The second type takes a sequence of inputs, but only gives out one single output, and
ignores other output at previous time steps. In other words, this is a sequence-to-vector
network. This type of network is used in a variety of circumstances. For instance, we can
feed in a sequence of words corresponding to tweets in twitter, and the RNNs do a sentiment
classification. [LFC14]

Another example is feeding in a sequence of time series data up to current time, and the
network predicts data for the future time steps. The third type of RNNs is named as
many-to-many.

Apart from the one in Figure 7.2, we can also make a combination of sequence-to-vector and
vector-to-sequence network. It is mostly used for language translation, where the sequence-
to-vector network is considered as an encoder, and vector-to-sequence network is viewed
as a decoder. When performing machine translation, the encoder encodes a sentence to a
single vector, then decoder decodes vector to another language. [KBC14]

111
(a) One-to-many (b) Many-to-one

(c) Many-to-many

Figure 7.3: Three types of RNNs structure, i.e., (a) One-to-many, (b) Many-to-one, (c)
Many-to-many. T represents the input and/or output sequence(s) length.

7.2 RNNs training process

During each epoch, the training process of a neural network starts with an initialization of
weights and biases, then follows a forward propagation and the computation of loss, and
eventually a backward propagation.

7.2.1 Initialization

The initialization step is important, as different initialization methods will generate different
outcomes for RNNs. As can be seen from Figure 7.5, the total loss J(Θ) is highly non
linear for a big number of input features and neurons, and it contains several local minima.
Therefore, it is very unlikely to achieve the global minimum in loss space, and the result
tends to be one of local minima. The goal is to lead the RNNs to the optimum minimum
by iterative training.

112
Figure 7.4: An example of cost function in 3D space. Source [DM18]

There are several ways to initialize weights. The first method is to initialize all weights
to zeros. This is the simplest method, but also the least effective one. When initializing
all weights to zero, if some neurons give the same output at the last layer, then the same
gradients are computed after backward propagation, this will generate the same updated
weights. This kind of symmetry is a problem that needs to be avoided when designing
RNNs. What we need is to have asymmetry values in the weights. However, biases can be
initialized to zeros. Since the weights are already asymmetrical, output at the last layer are
also asymmetrical regardless of the biases, so this will not create any trouble.

The second method is to initialize all weights randomly. By starting with random weights, we
can break network symmetry and make every neuron play its own role. This method is also
easy to apply but may create a problem for deep neural networks: vanishing gradients and
exploding gradients (which is explained later in Section 7.4.1). If the backward propagated
gradients get smaller and smaller, the optimization of loss function becomes so slow that
training process could be stopped. On the contrary, if the propagated gradients are too
large, the loss may oscillate around the minimum or even diverge. In this case, RNNs
training would become impossible.

Xavier Glorot and Yoshua Bengio proposed another initialization method which is also the
default initializer of Keras [XY10], named glorot uniform. By using this method, weights

113
are uniformly initialized within the interval [−b, b] where b is computed by Equation (7.1),
ul and ul−1 are the number of units at layer L and L − 1.

s
6
b= (7.1)
ul + ul−1

7.2.2 Forward propagation

We will now compute forward propagation equations for basic RNNs (Figure 7.2). The
hidden layers of RNNs can be considered as RNN cells, because they preserve some states
at each time step. The output of a recurrent neuron at time step t is a function of all the
inputs from the previous time steps. If we denote state of the hidden layer at time step t
as ht , then from t = 1 to t = τ , it can be written as a function of the previous state ht−1
and current input xt :
ht = f (ht−1 , xt ) (7.2)

Figure 7.5 shows an example of the RNN cell structure, here we assume the activation func-
tion used for recurrent step is the hyperbolic tangent function, and the activation function
used for output is a linear function. Before we start forward propagation, we assign an
initial state h0 for the network. From t = 1 to t = τ , we have the following equations for
each time step t:

ht = Wax xt + Waa at−1 + ba (7.3)

at = tanh(ht ) (7.4)

ôt = Wya at + by (7.5)

ŷt = linear(ôt ) (7.6)

the parameters are bias vectors ba , by , and weight matrices Wax , Waa and Wya . When
implementing forward pass, we use a vectorized form of previous equations over m samples.

• Suppose input sequence xt has a dimension of (nx , m), where nx is the number of
input features, m is the number of training samples.

114
• Suppose previous hidden state at−1 has dimension of (na , m), where na is the number
of neurons in the hidden layer, m is the number of training samples.

• Wax is a matrix with shape (na , nx ), containing weights for the inputs of current time
step t.

• Waa is a matrix with shape (na , na ), containing weights for the hidden state of previous
time step t − 1.

• Suppose ŷt is a matrix with shape (ny , m), where ny is the length of output sequence.
It is the prediction output at current time step t.

• Wya is a matrix with shape (ny , na ), containing weights for the hidden state of current
time step t.

• ba is a bias matrix of shape (na , 1).

• by is a bias matrix relating the hidden state to the output, of shape (ny , 1).

Figure 7.5: A RNN cell structure

115
7.2.3 Cost function calculation

After each forward propagation, RNNs produce several predicted outputs. The loss is com-
puted by a cost function using current outputs and targets. During the training phase, the
goal is to minimize cost function through iterations. Depending on different circumstances
we have different types of cost functions, such as the mean squared error and cross-entropy
loss. The cross-entropy loss is mainly used for classification problems, such as image recog-
nition, where we need to output a zero or one to indicate whether the target object belongs
to the image. In this project, we want to predict the number of visitors in an event area for
the next few minutes, which is better viewed as a regression problem. The loss function is
chosen to be MAE (mean absolute error). For t = 1 to t = τ , the loss is computed by:

m
1 X
Ji = kyij − ŷij k (7.7)
m j=1

Pi=τ
The total loss of RNNs is the sum of all the losses at each time step: J = i=1 Ji . It de-
scribes quantitatively the error between targets and RNNs output. The loss is then passed
back through hidden layers which will be explained the the next section.

7.2.4 Backward propagation

The objective of backward propagation is to minimize loss function and in the meanwhile, to
avoid overfitting problem. As the loss function is non-convex and differentiable, by starting
at a random point, we can gradually find a local minimum of the total loss by updating
weights and biases. The main process is described as follows:

1. With the cost function, we can compute gradients with respect to every weight pa-
rameter, then back propagate them through the RNNs by using the chain rule for
derivatives.

2. After the gradients propagate back to the starting point, we update the weight pa-
rameters (include weights and biases) by subtracting the product of corresponding
gradients with a learning rate.

116
The backward propagation through time is done in the opposite direction, from right to left,
as shown in Figure 7.6. At each time step, the gradients with respect to weight parameters
are computed by using the chain rule. We first calculate the partial derivative of total
loss with respect to current activation by using Equations (7.3) and (7.4), then calculate
gradients of activation with respect to the weight matrices by using Equations (7.5) and
(7.6), and finally, we calculate the gradient of total loss with respect to previous activation,
so as to propagate back and used as input to the cell at time step t − 1. Based on forward
pass equations, the backward propagation equations in a RNN cell are as follows:

∂at ∂at ∂ht


= = (1 − tanh(ht )2 ) · xTt (7.8)
∂Wax ∂ht ∂Wax
∂at ∂at ∂ht
= = (1 − tanh(ht )2 ) · aTt−1 (7.9)
∂Waa ∂ht ∂Waa
∂at ∂at ∂ht X
= = (1 − tanh(ht )2 ) (7.10)
∂ba ∂ht ∂ba
∂at ∂at ∂ht T
= = Waa · (1 − tanh(ht )2 ) (7.11)
∂at−1 ∂ht ∂at−1

Figure 7.6: The backward propagation at time step t for a RNN cell

7.3 Optimization Algorithms

In previous sections, we explained the training process of RNNs. During backward prop-
agation, weights and biases are updated using different techniques. We will first analyze
commonly used techniques, such as gradient descent and mini-batch gradient descent, and
then introduce two advanced techniques, which are Root Mean Square prop and Adaptive
moment estimation. These optimization algorithms have better performance and are used
in this project for RNNs training.

117
7.3.1 Gradient Descent

Depending on the amount of training data, we can choose between batch gradient descent,
mini-batch gradient descent and stochastic gradient descent to optimize neural networks.
When applying batch gradient descent, we use all the training data (size m) to update
weight parameters during one iteration. This process is specified in Equation (7.12), where
α is the learning rate.
∂J(w)
w := w − α · (7.12)
∂w

After some iterations, the cost function gradually reaches one of the local minima. However,
this process usually takes a long time when the training set is large. What is more, huge
training samples results in large memory cost in computer, where the memory size is usually
limited. When applying stochastic gradient descent (SGD), the weight parameters are
updated for each training sample, so the total loss decreases much faster, but the cost
function oscillates heavily.
∂J(w, x(i) , y (i) )
w := w − α · (7.13)
∂w

The problem mentioned above can be depicted by Figure 7.7. If the cost function is char-
acterized by J(w0 , w1 ), then the cost function can be approximated to a quadratic bowl,
ellipses are the horizontal cross-section of cost function, where vertical and horizontal axes
are the weights w0 and w1 . If the cost function is of high dimensions, then it is also locally
quadratic, therefore, we still face same issues.

(a) Batch gradient descent (b) Stochastic gradient descent

Figure 7.7: Gradient descent convergence diagram under two gradient-based algorithms.
Left: batch gradient descent. Right: stochastic gradient descent.

As a matter of fact, mini-batch gradient descent appears to be a compromise and it is widely


used in many cases. It updates weight parameters for every mini-batch of b (1 ≤ b ≤ m)

118
training samples. It reduces variance when updating parameters, which ensures convergence
speed, and also makes convergence process more stable.

∂J(w, x(i) , x(i+1) , ..., x(i+b) , y (i) , y (i+1) , ..., y (i+b) )


w := w − α · (7.14)
∂w

When tuning the speed and performance of gradient descent, learning rate appears to be
a difficult choice to made, as it affects a lot to the model performance. By using different
learning rate, we may get different outcomes, cause the total cost of RNNs ends up in
different local minimum. To start the process, an initial value for learning rate is set. If
we observe the total loss keeps growing or oscillates a lot, we reduce learning rate. If we
observe the total loss decreases consistently, but not convergence in the end, we have to
increase the learning rate. By adjusting learning rate step by step, we could achieve to the
optimum minimal cost.

However, in most cases, the cost function is characterized by a set of weights. If we want the
model approaches to the optimum minimal cost quickly, it is not enough to simply apply
mini-batch gradient descent algorithm. Therefore, two advanced techniques are introduced
in the following section, they are used frequently in neural networks.

7.3.2 RMSprop

Nowadays, Root Mean Square prop (RMSprop) is one of the techniques that are used
to achieve a fast convergence of cost functions. It is proposed by Geoffrey Hinton and
Tijmen Tieleman in 2012 [RMSprop]. As mentioned above, when using gradient descent
algorithms, even if the convergence process makes progress in horizontal direction to reach
the minimum, it may end up with big oscillations in vertical direction, so we use RMSprop
to speed up the process in both directions, or at least, in vertical direction. During each
iteration, the gradient with respect to each weight parameter is computed as usual, here we
assume we have two weights w0 and w1 , and their gradients dw0 and dw1 ; w0 represents the
weight of horizontal axis and w1 represents that of the vertical axis. In addition, two new

119
variables are computed:

Sdw0 = β · Sdw0 + (1 − β) · dw0 2 (7.15)

Sdw1 = β · Sdw1 + (1 − β) · dw1 2 (7.16)

The square operation works element-wise. Sdw0 and Sdw1 keep a moving average of the
squared gradient for each weight. β is a coefficient that is set default to 0.9.[RMSprop]
Then the weights w0 and w1 are updated as follows:

dw0
w0 = w0 − α · q (7.17)
Sdw0 + 
dw1
w1 = w1 − α · q (7.18)
Sdw1 + 

So instead of subtracting the product of learning rate with gradient of weight parameter,
we first divide the gradient by square root of Sdw + , where  is a coefficient of a very
small value. It is needed to ensure numerical stability, to prevent having a denominator too
close to zero. In this way, we can not only reduce variations in vertical directions, but also
speed up movement in horizontal direction. The reason is explained as follows: In Figure
7.8, we observe the convergence process represented by solid line is the mini-batch gradient
descent algorithm, and the dashed line represents RMSprop method. In mini-batch gradient
descent, dw1 tends to be larger than dw0 , but after applying Equations (7.17) and (7.18),
since dw1 is relatively big and dw0 is small, w0 becomes smaller than before and w1 becomes
much larger, consequently, during next iteration, we will have a smaller gradient in vertical
direction and a larger gradient in horizontal direction.

120
Figure 7.8: RMSprop convergence diagram. Solid line: convergence process using mini-
batch gradient descent. Dashed line: convergence process using RMSprop algorithm.

7.3.3 Adam

Adaptive moment estimation (Adam) is another optimization technique [KPB14], its idea
is the combination of momentum and RMSprop. It not only keeps track of an exponentially
decaying average of past gradients, but also keeps track of an exponentially decaying average
of past squared gradients. The algorithm is described as follows: First, we initialize Sdw ,
Sdb , Vdw and Vdwb . Then, at each iteration T , the gradients db and dw are computed by
mini-batch gradient descent as usual. Next, we update Vdw and Vdwb by using the equations
below, the principle is the same with momentum, where β1 is the decay hyperparameter for
momentum, default set to 0.9. After that, Sdw and Sdb are also updated like what we did
in RMSprop, with a decay hyperparameter β2 , default set to 0.999. The values for decay
parameters are the default settings in Keras.

Vdw = β1 · Vdw + (1 − β1 ) · dw (7.19)

Vdb = β1 · Vdb + (1 − β1 ) · db (7.20)

The next step is to reset the values for Sdw , Sdb , Vdw and Vdwb , by applying the equations

121
below:

Vdw
Vdw = (7.21)
1 − β1T
Vdb
Vdb = (7.22)
1 − β1T
Sdw
Sdw = (7.23)
1 − β2T
Sdb
Sdb = (7.24)
1 − β2T

At the end of iteration, the weight parameters are updated by:

Vdw
w =w−α· √ (7.25)
Sdw + 
Vdb
b=b−α· √ (7.26)
Sdb + 

For the choice of , the author of paper [KPB14] suggest to choose 10−8 , it is a factor
that does not affect much the performance of RNNs, so we do not need to tune about
this parameter during training. For the learning rate α, since this is an adaptive learning
rate algorithm, it requires less tuning of learning rate, makes it easier to use than gradient
descent.

7.4 Long Short-Term Memory

To predict the number of visitors in the next few minutes, a series of inputs are fed into
RNNs, usually they are data from the past tens of minutes. As the input sequence becomes
longer, number of input units becomes bigger, RNNs become deeper. As a consequence, it
may suffer from the vanishing or exploding gradient problem, and memory issues. The Long
Short-Term Memory (LSTM) was proposed by Sepp Hochreiter and Jurgen Schmidhuber
[SJ97] in 1997, and further studied by many other researchers. LSTM is designed to keep
or forget information in a long period, so as to improve RNNs performance.

122
7.4.1 Difficulties faced by deep RNNs

As explained in section 7.2.4, when applying backward propagation through layers, the
propagation direction is from upper layers to lower layers. The weight parameters are
updated once the gradients of loss function with regards to each weight are computed.
However, as we pass the gradients through layer to layer, the gradients get smaller and
smaller, or on the contrary, they become larger and larger. If we face the former case, since
the gradients become so tiny, weight matrices in lower layers remain unchanged through
iterations, this is the vanishing gradient problem; if we are in latter case, which means
weight coefficients increase drastically, the instability causes RNNs model diverges, we could
never achieve a local minimal.

We can illustrate the vanishing gradient problem by observing logistic activation functions
like the sigmoid and the tangent hyperbolic (see Figure ?? and 6.5), when the input tends to
be large, the output may tend to saturate at zero or one, thereby indicating that the gradient
is close to zero. This small gradient is difficult to propagate back for deep RNNs. Xavier
Glorot and Yoshua Bengio [XY10] found how activations and gradients vary across layers
during training phase, and studied non-linearity as well as saturation. In the paper, they
proposed an efficient way to initialize the neural network so as to achieve faster convergence.

The second problem faced when using deep RNNs is that it cannot keep up information from
the beginning until the end, which means at the end of input sequence, the deep RNNs may
lose the memory of the first inputs. This could strongly degrade RNNs performance in some
circumstances. For example, if one wants to do sentiment analysis or sentence translation,
the context of the phrase is important, one has to consider the words at the beginning to
make the final prediction. Unfortunately, normal RNNs gradually forgets what was said at
the beginning, so it may misinterpret the sentence. In this project, crowd number predictions
totally depends on the data from past tens of minutes, thus, we need to prolong the memory
of neural network, by using Long Short-Term Memory cells.

7.4.2 LSTM cell structure

The architecture of the cell is shown in Figure 7.9. As can be seen from the structure, at
each time step t, there are two network states, at and ct , where ct is the LSTM cell state,

123
and can be viewed as a long-term state. Apart from that, each LSTM cell contains four
gates, which help LSTM cell to decide what information to be kept, what information to be
forgot and what information to be read from. These four gates are forget gate, update gate,
input gate and output gate. In our case, the input sequence length is fixed, which means the
prediction outputs can depend on all or some input data, that is why LSTM is an essential
part of RNNs. Long term state from previous time step, ct−1 , first goes through forget gate,
in order to drop some memories, which is not useful in the future. Then it updates some
memories from the update gate and the input gate. Finally, the new long-term state ct is
passed to the next time step. The hidden state of previous time step, at−1 , is the input
for the four gates. The current hidden state, at is the combination of tangent hyperbolic
transformation of the current cell state and the output of output gate.

Figure 7.9: LSTM Cell structure

124
Γtf = σ(Wf [at−1 , xt ] + bf ) (7.27)

Γtu = σ(Wu [at−1 , xt ] + bu ) (7.28)

gt = tanh(Wc [at−1 , xt ] + bc ) (7.29)

ct = Γtf × ct−1 + Γtu × gt (7.30)

Γto = σ(Wo [at−1 , xt ] + bo ) (7.31)

at = Γto × tanh(ct ) (7.32)

• Wf , Wu , Wc , Wo are the weight matrices of four gates, they are used to multiply with
the previous hidden state as well as current input.

• bf , bu , bc , bo are the biases for each of the four gates. They are initialized to ones instead
of zeros, so as to prevent forgetting everything at the start of training.

• Forget gate Γtf controls which part of the memory should be discarded for the next
time step. In 7.27, Wf is the weight matrix that manages the behavior of forget gate.
[at−1 , xt ] is the concatenation of two matrices. The final result is the output of sigmoid
transformation, or hard sigmoid transformation, which belongs to [0, 1]. This value
is then multiplied with the previous long-term state. If the value in Γtf is close to
zero, it indicates to drop that part of information in ct−1 , if the value is close to one,
information is kept and passed to the next state.

• Update gate Γtu controls which part of memory needs to be updated for the current
long term state. It is computed by using Equation 7.28. Similar to the forget gate,
the update gate is also an output of sigmoid or hard sigmoid transformation, contains
values in [0, 1].

• In Equation 7.29, gt analyzes both input and previous hidden state, after that, gt is
used to generate current cell state. This part is different from general RNNs, in RNNs,
this output directly becomes the current hidden state and passed to compute yt .

• Output gate Γto controls which part of current cell state should be read and output
at this time step for new hidden state and yt . Γto is computed by Equation 7.31. From

125
Equation 7.32, we know that current hidden state is the element-wise multiplication
of output gate by the tangent hyperbolic transformation of the previous state.

7.5 Experimental results

All the simulations are made using Keras library, which is introduced in Section 6.4.1. We
downsampled our data to have a time resolution equal to 2 minutes. The input sequence
of model is selected of length 15, which means 30 minutes data are used as prediction
inputs. Similarly, an output sequence of length 5 is selected, so the LSTM model produces
10 minutes prediction data.

7.5.1 Training and Testing Data

In this project, training data, validation data and test data came from time series that we
generated in Chapter 4. The detail is listed in Table 7.1.

Training set Data from 2017/12/22 to 2017/12/29

Validation set Data from 2017/12/30

Test set Data from 2017/12/31

Table 7.1: Training, validation and test sets

7.5.2 Parameter setting

The hyperparameters that can be tuned in the LSTM model are:

• Learning rate

• Number of hidden layers, number of units per hidden layer

• Batch-size

• Dropout rate

• Number of iterations

126
• Activation functions used for LSTM and output layer

• Optimizers

To simplify the process, the learning rate is proposed to be 0.001, which is a sufficient value
in this case for RMSprop and Adam optimizers (see Section 7.3.2 and 7.3.3). Batch size
equals to 128. This parameter affects run time of training process, a smaller batch size means
that a longer training time is needed. Dropout is applied in the model to prevent model
from overfitting (as explained in Section 6.4.5). The dropout rate is set to 0.5. Activation
functions for the LSTM layer are tangent hyperbolic and hard sigmoid, whereas output
layer is a simple linear function as activation. The number of iterations is set to ensure
the training loss converges at the end of training, which means we stop the training process
when training loss saturate at a certain level.

The next decision is the architecture of RNNs. The RNNs are composed of an input layer,
one or several hidden layer(s) and an output layer. The number of hidden layers depends
on the complexity of the learning task.

If more hidden layers and neurons are chosen, then we can analyze a more complex data
system. However, we cannot put too many neurons on each layer, because it will lead to the
overfitting problems and a more difficult training procedure. As mentioned before, overfit-
ting happens when the system does not pay attention to data basic relationships anymore,
but focus on its noise pattern. As a consequence, overfitting causes an increase of loss. There
are two main solutions to prevent overfitting problem, the first one is to reduce the number
of neurons, so as to reduce network complexity. In this way, the minimum loss can be easily
reached, but, due to the fact that some information is lost, the optimum minimum is not
as low as we expect; therefore, the overall performance is degraded. The second solution
is regularization, such as L2 regularization and dropout [SHK14]. The implementation of
regularization techniques usually have better performance than diminishing network size.

If less hidden layers and neurons are chosen, then we may face the opposite problem, which
is underfitting. If it is the case, then the current model cannot figure out the underlying
structure of input data. It indicates that more model parameters should be added, or a
more powerful model should be selected, so as to analyze input data property.

127
7.5.3 Experiments of different LSTM network architectures

The relationship between input sequence length, amount of hidden layer neurons and output
sequence length is studied by some researchers but no definitive conclusion is made. The
decision has to be done based on the model itself. Thus, we experiment models with one
and two hidden layers.

7.5.3.1 LSTM network with one hidden layer

We will now experiment models with one hidden layer. The idea is to use Grid Check, to
explore a list of combinations of parameters and then select the most appropriate one. We
fix the number of hidden layer to one, and then a list of number of neurons are selected to
compare the model performance, the result is shown in Figure 7.10.

300
# neurons: 10
# neurons: 15
# neurons: 20
# neurons: 40
250 # neurons: 60
# neurons: 80
# neurons: 100
# neurons: 120
# neurons: 140
200
# neurons: 160
Validation loss

150

100

50

0
0 20 40 60 80 100 120 140 160 180 200
Epoch

Figure 7.10: Validation loss of different models with one hidden layer

As can be seen from Figure 7.10, when the number of neurons is relatively small, the
validation loss is high, this is due to the fact that model is underfitting. In this situation,
model performance could be improved by increasing the capacity of the model, such as the
number of neurons in a hidden layer or the number of hidden layers. By increasing the

128
number of neurons, more and more information could be added, network gradually learns
more interconnections between inputs and outputs. In addition, we found out that the
validation result error is smaller when network size becomes larger, and the convergence
speed is faster when the number of neurons increases, which means we need less epochs to
train the network.

The mean absolute error is calculated for each model on validation set, and the result is
shown in Table 7.2. Here we chose to display some models with better result than the rest.

Number of neurons 80 100 120 140 160

MAE on validation set 39.6799 39.7203 39.0102 38.8084 38.6526

Table 7.2: Result of LSTM model structure with one hidden layer

7.5.3.2 LSTM network with two hidden layers

In the second experiment, the idea stays the same, Grid Check method is applied. A list of
combinations of parameters is selected and the optimum one(s) is generated at the end. So
the number of hidden layer is set to two, and then a list of number of neurons for the two
hidden layers is used to test. The result is shown in Figure 7.11.

47

46

45

50 44
Validation loss

45
43

40
42
35
41
30
20 100
40
30 90
40 80
39
Nu
m 50 70
er1)
be 60 60 lay
ro s( 38
fn
eu 70 50
uron
ron
s(
80 40
f ne
o 37
lay 90 30
ber
er2) 100 20
N um

Figure 7.11: Validation loss of different models with 2 hidden layers

129
In the figure, x and y axis are the number of neurons at the first and second hidden layer,
while z axis is minimum validation loss achieved by the model. It is obvious that when the
number of neurons at both hidden layers are relatively small, final validation cost is big,
and the model performance is poor, it cannot do well in analyzing current training data,
and naturally the outcome it generated is unsatisfying.

When the number of neurons per hidden layer gradually increases, the validation cost tends
to reach its minimum. In the 3D plot, yellow indicates the lowest validation loss. Most of
the yellow squares are located at higher number of neurons of two hidden layers. What can
be noticed is that, for a fixed total number of neurons, when the number of neurons at the
first hidden layer (n1 ) is smaller than that of the second hidden layer (n2 ), the performance
is better than in the opposite situation. This is illustrated by Figure 7.12, where the total
sum of neurons is fixed to 120. For the three different combinations, solid lines are the cases
for n1 < n2 , while dash lines are the cases for n1 > n2 . We spot that RNNs learn faster
when n1 < n2 , so we will foucus on RNNs whose model structure is n1 < n2 for two hidden
layers.

130
300
# neurons hidden layer 1/2: 20/100
# neurons hidden layer 1/2: 100/20
# neurons hidden layer 1/2: 30/90
250 # neurons hidden layer 1/2: 90/30
# neurons hidden layer 1/2: 40/80
# neurons hidden layer 1/2: 80/40

200
Validation loss

150

100

50

0
0 20 40 60 80 100 120 140 160 180 200
Epoch

Figure 7.12: Validation loss of different models with total number of neurons 120. Three
scenarios are simulated, where solid lines are for the cases n1 < n2 , dashed lines are for the
opposite cases.

Another observation is made by fixing the ratio of number of neurons between two hidden
layers. The result is shown in Figure 7.13, as can be seen, when the number of neurons is
small, model learning speed is slow. Whereas when the number of neurons reaches a high
value (above 40 in the first hidden layer), errors on the training and validation sets go down,
and saturate at a certain level, the model learns faster and produces better results.

Based on the observations and MAE values, the best four models are selected, and their
MAE value of validation set are listed in Table 7.3.

Model (15,90,100,5) (15,60,80,5) (15,50,90,5) (15,90,90,5)

MAE on validation set 36.5884 36.6864 36.8564 36.8726

Table 7.3: Result of model structure with two hidden layers

131
300
# neurons hidden layer 1/2: 20/40
# neurons hidden layer 1/2: 30/60
250
# neurons hidden layer 1/2: 40/80
# neurons hidden layer 1/2: 50/100

200
Validation loss

150

100

50

0
0 20 40 60 80 100 120 140 160 180 200
Epoch

Figure 7.13: Validation loss of different models with a fixed ratio. The ratio of the number
of neurons corresponds to the first hidden layer and that of the second hidden layer is 0.5.

7.5.3.3 Model selection

When comparing results of models with two hidden layers to models with one hidden layer,
it is obvious that models with two hidden layers have better performance. Usually, as we
want to extract high level information from raw measurements, we need more layers in the
middle to perform different tasks, for example, the first hidden layer mainly extracts the
connections from past values, while the second hidden layer focus on the noise pattern that
needs to be added on future data. If network contains only one hidden layer, it is hard to
generalize all useful information at once, therefore, if we want to achieve results as good as
using multiple layers, we have to apply much more neurons. In fact, a model with more
hidden layers can learn to fit more complex functions or features with fewer neurons; so,
from now on, we select model (15, 50, 90, 5) to generate forecast results.

The following results are the forecasts made by model (15, 50, 90, 5), where 15 is the inputs
data length, meaning input data from the past 30 minutes, 5 outputs corresponds to results
of prediction horizon 2,4,6,8 and 10 minutes. Table 7.4 is the summary of the chosen model.

132
Layer Neurons Parameters
Dense input 15 15
Hidden LSTM 50 10400
Hidden LSTM 90 50760
Dense output 5 455

Total amount of parameters 61615

Table 7.4: The architecture of the proposed LSTM Network

7.5.4 Weights and biases initialization

The initialization of model weights and biases is essential, it is the starting point of training
process. Based on the model that we selected in Section 7.5.3.3, we analyzed the impact
using different initialization methods.

As explained in Section 7.2, by using the default initialization method of Keras, glorot
uniform, the weights are neither initialized too big nor too small, and then we can keep
information flowing during forward and backward propagation. Table 7.5 are the results
using different initialization methods. Glorot uniform initialization gives the best results,
while initializing to zeros or ones is less powerful.

Initialization method MAE on validation set

Glorot uniform 36.8303


Random uniform 37.9785
Random normal 38.4420
Zeros 41.2312
Ones 42.8774

Table 7.5: Results of proposed LSTM model using different initialization methods

7.5.5 Optimizers

LSTM model performance can be affected by using different optimizers. We will now analyze
impact on our model using different optimizers. The optimizers that we experimented
include mini-batch gradient descent, RMSprop and Adam. Figure 7.14 illustrates the result

133
obtained using the proposed model. The learning rate for mini-batch gradient descent is
0.005, while for RMSprop and Adam, it is 0.001. As can be seen, although the learning
rate of gradient descent is five times higher than these of the other two techniques, it has
the slowest convergence speed, RMSprop and Adam are much more faster than mini-batch
gradient descent. As explained in previous sections (see Section 7.3.2 and 7.3.3), they are
designed for speeding up approaching local minimum. Since there is no big difference on
the result of RMSprop and Adam optimizers, we chose RMSprop for the final predictions.

300
RMSprop
mini-batch GD
Adam
250

200
Validation loss

150

100

50

0
0 20 40 60 80 100 120 140 160 180 200
Epoch

Figure 7.14: The training result of our model under three optimization algorithms, i.e., (a)
RMSprop, (b) Mini-batch gradient descent, (c) Adam, with respect to epochs.

7.5.6 Prediction results

We will now evaluate the forecasts that are produced by the proposed model quantitatively
and qualitatively. The mean absolute errors of different prediction horizons are listed in
Table 7.6. For a small prediction horizon, the error is relatively small, because what will
happen in the next minute is strongly linked to the previous minutes, and this relationship
is not difficult to find, thus the error is smaller.

However, in reality, in order to prevent from crowd turbulence happening, the forecast
horizon has to be longer, so as to provide enough time for security staffs to take measures

134
(for instance, evacuate people, or open more exit doors). Therefore, we aim to realize a 10
minutes prediction.

As can be seen from Table 7.6, the error on 10 minutes forecast is higher than that on
2 minutes. This is because our model is causal, value at time step t only depends on
values from previous time steps, not on the future. Since the forecast period is long, the
link between future values and past values is weaker than before, so it becomes difficult
to generate a precise relation. Consequently, the error is accumulated from time to time,
which means that errors made on shorter time predictions are a part of the errors on longer
time predictions. As a matter of fact, to minimize error on long period forecast, we have to
first minimize error on short prediction period, therefore, we select the model with 2 hidden
layers.

Prediction horizon (min) 2 4 6 8 10

MAE on test set 23.2202 26.4487 27.9255 29.2444 30.3003

Table 7.6: Results of different prediction horizons of proposed LSTM model on December
31

Figure 7.15 depicts the final results on test set from several sensors. The prediction horizon
is 10 minutes. The red solid line is the true data that are gathered by different sensors,
blue solid line is the prediction produced by LSTM network, and yellow solid line is the
absolute error between real value and prediction value. Figure 7.16 and 7.17 are zoomed
version. We choose results of sensor 1 and sensor 2 as a representative, to have a better
illustration of difference between prediction values and real values. As can be observed for
all the plots, during peak hours, the oscillation of crowd number is bigger, which makes its
variation behavior harder to predict, thus, the prediction error is bigger. During peak-off
hours, like early in the morning, during which the amount of visitors is small, it is easier to
learn the variation pattern.

135
900
Predict
800 Ground truth
Error
# MAC address detected 700

600

500

400

300

200

100

0
0 1000 2000 3000 4000 5000

Figure 7.15: Full prediction results of LSTM model on December 31 by all sensors. The
prediction horizon is 10 minutes. The absolute error between predicted values and target
values are shown in yellow.

500
Predict
450 Ground truth
Error
# MAC address detected by sensor 1

400

350

300

250

200

150

100

50

0
0 200 400 600 800 1000 1200
Time (minutes)

Figure 7.16: Prediction results of LSTM model on December 31 at Place de la Bourse. The
prediction horizon is 10 minutes. The absolute error between predicted values and target
values are shown in yellow.

136
600
Predict
Ground truth
Error
# MAC address detected by sensor 2
500

400

300

200

100

0
0 200 400 600 800 1000 1200
Time (minutes)

Figure 7.17: Prediction results of LSTM model on December 31 at Place de la Monnaie.


The prediction horizon is 10 minutes. The absolute error between predicted values and
target values are shown in yellow.

The model is trained and validated by using a mixed datasets from all sensors, but our
figures shows that different sensor areas have different time series patterns. For example,
during peak hours, the number of detected MAC addresses at Grand-Place and Marché
aux Poissons are higher than that of other event areas. Based on this fact, we computed
errors for a prediction horizon of 10 minutes for some areas. Table 7.7 shows the results.
The errors of sensor 5 (Marché aux Poissons) and sensor 7 (Grand-Place) are greatly higher
than the average level, which indicates that, we need more training for those areas, or build
a more powerful model to train them individually.

Event area MAE on test set

Place de la Bourse 24.2764


Place de la Monnaie 24.4343
Marché aux Poissons 42.4466
Grand-Place 46.5589
Average MAE 34.4290

Table 7.7: MAE of different Plaisirs d’Hiver event areas. The MAE of different areas are
computed based on a prediction horizon of 10 minutes.

137
7.5.7 Impact of input size

In the proposed LSTM model, the input length is assumed to be 15, so the amount of
information that can be learned by LSTM network is fixed to 30 minutes. During prediction,
the inputs act like a sliding window for the system, the forecasting result of next few minutes
totally based on data inside window. In addition, as illustrated in Figure 7.18, for online
machine learning, as time goes by, new input data is used to update model for future
predictions. If the input sequence length is too short, then the prediction error is too high
and it needs a long time of training to get improvements. If the input sequence length is
too long, the prediction error may be small at the beginning, but it requires more time to
update model when new data becomes available.

Figure 7.18: Online machine learning with a sliding window for time series forecasts

An experiment is made by prolonging and shortening input sequence length, which means we
provide more or fewer information to the system, and we observe the influence on prediction
error.

138
Input size 5 10 15 20

MAE on test set 41.8950 39.5024 36.8303 36.7670


Input size 25 30 35 40

MAE on test set 35.2126 36.5009 36.0057 35.5249

Table 7.8: Results of LSTM models of different input sizes. The MAE is computed based
on a prediction horizon of 10 minutes.

As shown in Table 7.8, the variation of the input size has an influence on output performance.
Data in bold is the model that we selected. The LSTM model is generated based on this
input length. By shortening input sequence length, the mean absolute error is relatively
higher, this is due to the fact that, when less information is provided, our model cannot learn
as much as before. By prolonging input sequence length, more information are provided for
the system, but the final forecast error is not much reduced. Therefore, for the LSTM model
that we selected, longer input sequence is not very helpful but increases model training time.

7.5.8 Comments on LSTM model

The objective of this chapter is to experiment different RNNs with LSTM, and analyze
forecasting results with respect to different factors. The main purpose is to evaluate whether
we can apply RNNs model for Plaisirs d’Hiver in Brussels and provide effective prediction
results. Usually the event lasts for five weeks. The idea in this project is to use measurements
from a certain event period to train, then evaluate performance on test set, which is an
offline machine learning process. And, by experimenting models with different parameters,
we found a LSTM model that can generate good forecasting results, which proved that
RNNs can be used to forecast crowd densities in events like Plaisirs d’Hiver.

In future, we could apply this method in online machine learning. Data from previous years
can be used as training and validation sets to build LSTM model. During the event, on the
one hand, we use new coming data to update model; on the other hand, we make predictions
to help organizers monitor crowd numbers and movements. We could also adjust our model
by leveraging new features. In the project, we consider that all sensors’ measurements are
independent, but in reality, as visitors are walking between different areas, data from one

139
sensor has influence on data of another sensor. The impact factor can be considered as one
of features in neural networks. Besides, we can take special activities into consideration,
for instance, there are regular lighting shows and parades in Plaisirs d’Hiver. During these
special activities, we could foresee an increase of crowd number in that area. Therefore, the
timetable of activities can be added as features. Apart from those, we can take into account
weekdays and weekends, holidays and work days to increase accuracy on training.

140
7.6 Gated Recurrent Units

Neural networks with Gated Recurrent Units (GRU) are a type of RNNs which mitigate
the effects of the vanishing gradient problem. Although they are similar to LSTM (Long
Short-Term Memory), there are vast differences. Networks featuring GRUs are often used
in the field of machine translation and are very promising. However, these can of course be
implemented on different types of data, such as the one presented here. This is due to the
fact that an RNN is able to memorize certain relevant parts of the data and to accordingly
adapt its previously assigned weights (between neuron connections). As stated before, to
achieve better results regarding predicting sequential data, Recurrent Neural Networks are
used. Figure 7.19 displays a basic schematic of a Long Short-Term Memory while Figure
7.20 depicts the differences of a Gated Recurrent Unit [GRULSTM].

Figure 7.19: LSTM. Source [GRULSTM]

Figure 7.20: GRU. Source [GRULSTM]

In Figure 7.19, i is the input gate, f the forget gate and o the output gate. The memory
cell is given by c and the new memory cell content by c̃. The GRU structure in Figure 7.20

141
has a reset gate r and an update gate z. h stands for the activation and h̃ for the candidate
activation.

The activation h of the GRU block is given by

hjt = (1 − ztj )hjt−1 + ztj h̃jt (7.33)

As can be seen from Equation 7.33, hjt is a function of hjt−1 (denoting the activation of
the previous GRU block) and h̃jt (candidate activation). Depending on the update gate,
the activation will be more influenced by the previous activation than by the candidate
activation (or vice versa). The update gate ztj is given by

ztj = σ(Wz xt + Uz ht−1 )j (7.34)

with Wz being the weight matrix of the inputs xt and Uz is the weight matrix associated
with the previous hidden states ht−1 . In stark contrast with an LSTM memory cell, a GRU
memory cell does not have any control over how much memory content it will be influenced
by. Moreover, the candidate activation is given by

h̃t = tanh(W xt + U (rt · ht−1 ))j (7.35)

and rt is a vector containing the reset gates. Computation of the reset gate is analog to
Equation 7.34 (but with different weights) and is thus:

rtj = σ(Wr xt + Ur ht−1 )j (7.36)

This reset gate resets in a manner what was previously processed by a previous block. The
important difference with a GRU is that the full content of a memory is used rather than only
a selected part in the case of LSTM. With LSTM there is a clear distinction between adding
new data memory to its memory cells and the forget gate. This is not the case with a GRU,
the previous activation and candidate activation are closely computed together through
the update gate. According to multiple sources [GRULSTM], RNNs equipped with GRU
blocks have a slightly better performance than RNNs with LSTM implemented. Although

142
one of the major benefits is that GRU - RNNs have a lower computational complexity.
LSTM makes use of three gates while there are only two gates in the GRU case. Reducing
this method to even one single gate can produce comparably good result (although usually
below the performances of LSTM and GRU), this is called Minimal Gated Unit (MGU)
RNN [GRULSTM].

7.7 Variants of GRU

By slightly changing the formula for the update and reset gates, ztj and rtj respectively (given
by Equations 7.34 and 7.36), several variants of a GRU can be built [GRUVAR].

• GRU1 is the first variant for which only the activation from the previous block is
used and a bias term. The update and reset gates are given by

ztj = σ(Uz ht−1 + bz )j (7.37)

rtj = σ(Ur ht−1 + br )j (7.38)

In test results (Figure 7.21) it is shown that the accuracy of GRU1 is not considerably
lower than the original GRU version despite having less parameters in the network.

• GRU2 is defined exactly as GRU1 but with the bias term removed in the equations.
This variant has less parameters to optimize than GRU1.

ztj = σ(Uz ht−1 )j (7.39)

rtj = σ(Ur ht−1 )j (7.40)

• GRU3 is the third variant that only keeps the bias terms and returns simple results.

ztj = σ(bz )j (7.41)

rtj = σ(br )j (7.42)

143
It speaks for itself that there are even less parameters needed to be optimized in this
version compared to GRU2. GRU3 has the lowest performance but considering its
simplicity and low number of parameters it still gives a satisfying result.

Applying these methods on the MNIST database is depicted in the following figure where
the behavior of GRU1, GRU2 and GRU3 are compared with the original GRU. The accuracy
of the model for each variant is given in function of epochs and with a constant learning rate
of 0.001. In this particular environment, a dropout of 20% is realized and the RMSProp
optimizer method is used. The implemented cost function is the categorical cross-entropy.
Only ReLU activation functions are used.

Figure 7.21: GRU variant models made based on MNIST row-wise generated sequences
having a batch size of 32 and 100 hidden units. Source [GRUVAR]

The GRU variants are based on the fact that there is redundancy in the network in the sense
of redirecting information of its state and other signals. From these plots we can conclude
that those three GRU variants perform almost equally well as the original GRU network
and this with a smaller computational cost. The experiments and results obtained when
working with our time series data are only limited to the original GRU network and not the
variants.

7.8 Results with GRU Networks

To understand what the most optimal amount of neurons should be in the GRU layers and
which activation function that should be used, a grid search has been made. The grid search

144
in Figure 7.22 computes the accuracy of each model for every parameter value in the grid.
The model that has been analyzed has two hidden GRU layers.

Figure 7.22: Grid search to compare increasing neurons with different activations

We can see that the best result is delivered with a ReLU activation function and 75 neurons
in each of its hidden layers. As more neurons (past 75) are added to the GRU layers we can
see that the accuracy of the neural network decreases regardless of which activation function
is used. Before this number the accuracy more or less steadily increases. When we analyze
the different activation functions, we notice that the curve of the ReLU activation function
is always higher than the other curves. This means that the best activation to be used for
our data is the ReLU function.

A GRU network has been implemented. It consists of a dense input layer of 15 neurons,
2 hidden GRU layers of 50 neurons each and an output layer of 5 neurons (see Table 7.10
and others for more details). The bias terms have not been put to zero and are thus used
in all the GRU layers. This neural network makes use of the MAE loss function and uses
the Adadelta optimizer. The optimizer uses the same initalizations as mentioned in the

145
Feedforward Neural Networks which are: learning rate η set to 1, decay constant ρ to 0.95,
 set to None, learning rate decay after every update to 0. Furthermore, in all the units
we have made use of the ReLU activation function as was identified by the grid search. In
earlier sections it was also explained that this is one of the most used functions.

Figure 7.23: GRU Network with 2 hidden layers, each containing 50 neurons and a prediction
horizon of 10 minutes

Figure 7.24: GRU Network with 2 hidden layers, each containing 50 neurons and a prediction
horizon of 10 minutes, zoomed in

146
Figure 7.25: Losses of GRU Network with 2 hidden layers, each containing 50 neurons

The losses of the validation dataset seem to be fluctuating more than the other types of
network. Another simulation has been made with these exact configurations but now with
an increase of neurons in the GRU layers. The losses per epoch are plotted against each
other (Figure 7.26).

Figure 7.26: Losses of GRU Network with 2 hidden layers, each containing 10, 50 or 100
neurons

We can see that there is not too much difference between the different loss curves when the

147
neurons are too small (i.e. 10 and 50 neuron curve). The losses significantly decrease when
the neurons reach 100.

Number of neurons 10 50 100


MAE 76 74 29

Table 7.9: MAE losses of a GRU network for different amount of neurons in its 2 hidden
layers and a prediction horizon of 10 minutes

The losses are displayed in Table 7.9 for several numbers of neurons. Below 100 neurons in
the hidden layers, the losses are unacceptably high in practice.

Layer Neurons Parameters


Dense input 15 15
Hidden GRU 10 1830
Hidden GRU 10 630
Dense output 5 50
Total amount of parameters 2525

Table 7.10: Network Structure of GRU Network with 10 neurons in the hidden layers

Layer Neurons Parameters


Dense input 15 15
Hidden GRU 50 15150
Hidden GRU 50 15150
Dense output 5 250
Total amount of parameters 30565

Table 7.11: Network Structure of GRU Network with 50 neurons in the hidden layers

Layer Neurons Parameters


Dense input 15 15
Hidden GRU 100 45300
Hidden GRU 100 60300
Dense output 5 500
Total amount of parameters 106115

Table 7.12: Network Structure of GRU Network with 100 neurons in the hidden layers

148
Tables 7.10, 7.11 and 7.12 shed light on how the number of parameters increase when more
neurons get added to the layers. When the number of neurons increases, the losses seem
to fluctuate more. Shuffling the data at every epoch can also improve the accuracy of the
model. In Figure 7.27 such a network has been plotted while the losses of the original GRU
network without data shuffling has been shown in Figure 7.28.

Figure 7.27: Losses of GRU Network, data shuffled every epoch

Figure 7.28: Losses of GRU Network, data not shuffled every epoch

We can see that the oscillations decrease when the data is shuffled at every epoch. Zoomed

149
in, its behavior is also noticeable as the curve is more sensitive to spikes originating from
the data (Figures 7.29 and 7.30).

Figure 7.29: Zoomed in GRU Network, data shuffled every epoch

Figure 7.30: Zoomed in GRU Network, data not shuffled every epoch

Adding multiple GRU layers in the network can also have an impact. A study has been
made regarding this and several network structures have been proposed. A fixed number
of neurons has been chosen for all the layers (i.e., 50 neurons for each hidden layer). The

150
original GRU network structure has already been laid out in Table 7.11. A second structure
is given in Table 7.13 and a third in Table 7.14.

Layer Neurons Parameters


Dense input 15 15
Hidden GRU 50 15150
Hidden GRU 50 15150
Hidden GRU 50 15150
Dense output 5 250
Total amount of parameters 45715

Table 7.13: Network Structure with 3 hidden GRU layers

Layer Neurons Parameters


Dense input 15 15
Hidden GRU 50 15150
Hidden GRU 50 15150
Hidden GRU 50 15150
Hidden GRU 50 15150
Dense output 5 250
Total amount of parameters 60865

Table 7.14: Network Structure with 4 hidden GRU layers

The number of parameters of the network significantly increase, which has a negative in-
fluence on the CPU performance. We will only simulate the model for 70 epochs as the
evolution of the curves will already be very visible then (values stay more or less constant)
and the network would not have to train for unnecessarily long periods of time.

151
Figure 7.31: Comparison of losses of networks with multiple GRU layers and a prediction
horizon of 10 minutes

7.8.1 Discussion

Significant results can only be achieved when the number of neurons in the hidden layers is
high enough. At 100 neurons in each of the 2 hidden GRU layers, the losses go down to an
MAE value which is comparable to the results of other neural network types. Adding more
than two hidden layers seems to have an undesired effect and increases the loss function.
For this time series data, a number of two hidden GRU layers produces the most optimal
results and this with 100 neurons in each of the layers. All the tests have been made with
a prediction horizon of 10 minutes (5 neurons in the output layer). Experiments conducted
in this section revealed that shuffling of the data at every epoch was very beneficial as it
helped to mitigate the lagging effects. The model had a shorter lag to the data compared to
the model without shuffling data at every epoch. However it must be noted that the amount
of parameters greatly increases per added GRU layer. The number of added parameters in
the feedforward neural network case is much less than with a GRU network. Training of

152
the network thus takes considerably longer.

153
8. Analysis of the presented
forecasting techniques
The four forecasting techniques that we experimented with are: ARIMA, seasonal ARIMA,
Feedforward neural networks and RNNs.

If the available data set is enough for training, then neural networks perform better in
general cases. An example to illustrate is by applying seasonal ARIMA techniques, the
MAE is 70.5 for a prediction horizon of 10 minutes at Grand-Place; while by applying
RNNs (with LSTM), the MAE is 46.5 with the same prediction horizon and the same area.
However, if the available data set is small, ARIMA/seasonal ARIMA becomes the first
choice as they are well performed on small data sets. In Chapter 4, all the forecast results
of ARIMA are generated based on one-day training, and in Chapter 5, the forecast results
of seasonal ARIMA are generated based on three or four days of training. Both of them
give promising results.

Regarding the model parameters, ARIMA needs (p, d, q) to characterize a model, where p
is the AR order, d is the difference order needed for stationarity, and q is the MA order.
For seasonal ARIMA, we need three more coefficients, (P, D, Q), they represents seasonal
AR order, seasonal difference order and seasonal MA order. For neural networks, we do
not need to characterize the model, the parameters are learned by model itself, but there
are many hyperparameters needs to be tuned. The hyperparameters have big impacts on
forecast results.

ARIMA and seasonal ARIMA requires a small amount of time to train the model, while
neural networks need longer time, and it also takes more time for neural networks to run
and generate results.

In future, models with an even greater accuracy can be obtained if we would not see the
sensors as independent of each other. It is shown in Chapter 2 that crowd movements are
present between sensors. Taking those into account can improve the forecasting. One of the

154
methods to do this is to use Vector ARIMA models (VARIMA) where the coefficients are
replaced by a matrix.
To further improve the accuracy of our forecasting techniques, a broader set of parameters
could have been taken into account as well. These can be values which would be able to
influence an event like the weather for instance. BME hosts a range of different events
which all have a different duration of time. Taking these matters into consideration may
also positively influence the forecasting model.
Lastly, it can be interesting to take the data from previous years (but from the same event)
into account such as to have more data and thus likely able to achieve an even greater
accuracy.

155
9. Conclusion
We analyzed four forecasting techniques for crowd density forecast, which are ARIMA,
SARIMA, Feeforward neural networks and RNNs. These aforementioned forecasting tech-
niques are applicable for events like Plaisirs d’Hiver. In the case of ARIMA models, even
shorter events would result in acceptable results. The study first started off by developing
a crowd monitoring system with multiple sensors communicating to a central server. This
objective was successful as relevant data had been gathered. The data that we collected
are compared to the data provided from Proximus. The data is very similar and certain
(seasonal trends/shapes were visible). This brings us to the first forecasting method we
applied on our data. The ARIMA method is known to use the influence of past samples
to predict future samples through linear relationships. Prediction horizons up to 21 min-
utes have been achieved which is not insignificant. Next, even greater prediction horizons
were successfully obtained by implementing a seasonal ARIMA which exploits the fact that
there is seasonality in the time series data. Seasonal ARIMA models were able to achieve
acceptable results with a 30 minute horizon which is the longest horizon of all the studied
forecasting techniques in this thesis. Several different types of neural network structure
have also been designed and applied on our data. The first type is a simple feedforward
neural network. These networks have data passing in only one direction from layer to layer
while RNNs have data flowing in both directions between layers. With time series data,
it is generally assumed that RNNs present more accurate results, although this study has
shown that with our feedforward networks provide the most accurate results compared to
the types of RNNs discussed here. The two types of RNNs presented are RNNs with LSTM
layers and with GRU layers. Of these two types, the RNNs with GRU seemed to provide
slightly better results compared to an RNNs with LSTM, even though the results were very
similar to one another. It must be noted that the computational complexity of a GRU RNN
is lower than an LSTM RNN, which makes these results more promising.

156
References

• Articles

[BBQL13] Bonne Bram, Barzan Arno, Quax Peter, and al. WiFiPi: Involuntary Tracking
of Visitors at Mass Events. The 14th International Symposium and Workshops on Applied
Reconfigurable Computing. In IEEE Consumer Electronics Magazine. 2013. pp. 1-6.

[BBPS14] Braun M., Bernard T., Piller Olivier, and al. 16th Conference on Water Distri-
bution System Analysis, WDSA 2014. In Procedia Engineering. 2014, 89, pp. 926-933.

[CDP16] Chilipirea Cristian, Dobre Ciprian, Petre Andreea-Cristina. Presumably simple:


monitoring crowds using WiFi, Proc. 17th IEEE Conf. Mobile Data Management, pp.13-16
June 2016, Porto, Portugal.

[KBC14] Cho Kyunghyun, Van Merriënboer Bart, Gulcehre Caglar, and al. Learning phrase
representations using RNN encoder-decoder for statistical machine translation, 2014, pp. 1-
11.

[CCMT90] Cleveland Robert B, Cleveland William S, Mc Rae Jean E, and al. STL: a
Seasonal-Trend Decomposition Procedure Based on Loess. In Journal of Official Statistics.
1990, 6(1), pp. 3-73

[LFC14] Dong Li, Wei Furu, Tan Chuanqi, and al. Adaptive recursive neural network for
target-dependent twitter sentiment classification, Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics, 2014,2, pp. 49-54.

[D00] Delignieres Didier. Séries temporelles - Modèles ARIMA. In Séminaire E.A « Sport-
Performance-Santé ». Mars 2000, pp. 1-19.

[DJ02] Eck Douglas, Schmidhuber Jürgen. Finding temporal structure in music: Blues
improvisation with LSTM recurrent networks. Neural Networks for Signal Processing, 2002.

157
Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing. IEEE,
2002: pp.747-756.

[XY10] Glorot Xavier, Bengio Yoshua. Understanding the difficulty of training deep feed-
forward neural networks, Proceedings of the thirteenth international conference on artificial
intelligence and statistics. 2010, pp. 249-256.

[GAH13] Graves Alex, Mohamed Abdel-Rahman, Hinton Geoffrey. Speech recognition with
deep recurrent neural networks. Acoustics, speech and signal processing (icassp), 2013 IEEE
International conference.

[HZS16] Hoang Minh X., Zheng Yu, Singh Ambuj K. Forecasting Citywide Crowd Flows
Based on Big Data. Proceedings of the 24th ACM SIGSPATIAL International Conference
on Advances in Geographic Information Systems October 31 2016, pp. 1-10.

[SJ97] Hochreiter Sepp, Schmidhuber Jürgen. Long short-term memory, Neural computa-
tion, 1997, 9(8), pp.1735-1780.

[HHY11] Hsieh Tsung-Jung, Hsiao Hsiao-Fen, Yeh Wei-Chang. Forecasting stock markets
using wavelet transforms and recurrent neural networks: An integrated system based on
artificial bee colony algorithm. Applied soft computing, 2011,11(2): pp.2510-2525.

[IO16] Iwok Iberedem, Okpe A.S. A Comparative Study between Univariate and Multivari-
ate Linear Stationary Time Series Models, American Journal of Mathematics and Statistics,
2016, pp. 203-212.

[KPB14] Kingma Diederik P., Ba Jimmy Lei. Adam: A method for stochastic optimization,
2014, pp. 1-15.

[KV15] Kumar S. Vasantha, Vanajakshi Lelitha. Short-term traffic flow prediction using
seasonal ARIMA model with limited input data, In European Transport Research Review,
2015, 7 (21), pp. 1-9.

[LR12] Lee Muhammad Hisyam, Rhaman Nur Haizum Abd. Seasonal ARIMA Forecasting
Air Pollution Index: A case Study. In American Journal of Applied Sciences. 2012, 9(4),
pp. 570-578.

[MMD17] Martin Jeremy, Mayberry Travis, Donahue Collin, and al. A study of MAC ad-
dress randomization in mobile devices and when it fails. Proceedings on Privacy Enhancing
Technologies, 2017(4), pp. 365-383.

158
[MR17] Miraftabzadeh Seyed Ali, Rad Paul, Choo Kim-Kwang Raymond and al. A privacy-
aware architecture at the edge for autonomous real-time identity re-identification in Cloud
Computing IEEE, 2017,4, pp. 16-20.

[SHK14] Srivastava Nitish, Hinton Geoffrey, Krizhevsky Alex, and al. Dropout: A simple
way to prevent neural networks from overfitting, The Journal of Machine Learning Research,
2014, 15(1), pp. 1929-1958.

[VNDW12] Versichele Mathias, Neutens Tijs, Delafontaine Matthias, and al. The use of
Bluetooth for analysing spatiotemporal dynamics of human movement at mass events: a
case study of the Ghent festivities, Applied Geography, 2012, 32(2), pp. 208-220.

[ZIO14] Zaremba Wojciech, Sutskever Ilya, Vinyals Oriol. Recurrent neural network regu-
larization[J]. 2014, pp. 1-6

[ZQL17] Zhang Junbo, Zheng Yu, Qi Dekang, and al. Predicting Citywide Crowd Flows
Using Deep Spatio-Temporal Residual Networks, Artificial Intelligence, 2017, pp. 1-21

[WFR13] Wirz Martin, Franke Tobias, Roggen Daniel, and al. Probing crowd density
through smartphones in city-scale mass gatherings, EPJ Data Science, 2013, 2(5).

• Books

[AS16] Ayanendranath Basu, Srabashi Basu. A User’s Guide to Business Analytics. 1st
edition. 2016. New York. Chapman and Hall/CRC. 384 p.

[BJ70] Box George E.P., Jenkins Gwilym M. Time series analysis forecasting and control.
1970. San Francisco. Holden-Day. 553 p.

[BD02] Brockwell Peter J., Davis Richard A. Introduction to time series and forecasting.
2002. 2nd edition. New York. Springer. « Springer texts in statistics ». 443 p.

[DDV15] Dehon Catherine, Droesbeke Jean-Jacques, Vermandele Catherine. Eléments de


statistique. 2015. 6e édition revue et augmentée. Bruxelles. Editions de l’Université de
Bruxelles. « Statistique et mathématiques appliquées ». 696 p.

[DFT89] Droesbeke Jean-Jacques, Fichet Bernard, Tassi Philippe. Séries chronologiques.


Théorie et pratique des modèles ARIMA. 1989. Paris. Economica. « Association pour la
Statistique et ses Utilisations ». 299 p.

159
[HA14] Hyndman Rob J., Athanasopoulos George. Forecasting: Principles and Practice.
2014. Australia. Monash University. 138 p.

[NS13] NIST/SEMATECH : e-Handbook of Statistical Methods (2013). US Department of


Commerce. Available at: https://www.itl.nist.gov/div898/handbook/

[SS06] Shumway Robert H., Stoffer David S. Time series analysis and its applications with
R examples. 2006. 2nd edition. New York. Springer. “Springer texts in statistics”. 589 p.

[S13] Still G. Keith. Introduction to crowd science. 2013. New-York. CRC Press. 278 p.

• Courses

[DM18] Deligiannis Nikolaos, Munteanu Adrian (2018). Machine Learning and Big Data
Processing. Free University of Brussels, 22 chapters.

[GJ] Grandell Jan. Formulas and survey. Time series analysis. SD. Stockholm. KTH Royal
Institute of Technology. 31 p.

[RMSprop] Hinton Geoffrey, Srivastava Nitish, Swersky Kevin. (2018) Neural Networks for
Machine Learning. [online] Available at:

http://www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdf

[ST18] Sadigov Tural, Thistleton William (2018). Practical Time Series Analysis. State
University of New York, 6 lessons.

[WS18] Welcome to STAT 510 : Applied Time Series Analysis (2018). Pennsylvania State
University, 14 lessons. [online] Available at: https://newonlinecourses.science.psu.edu/stat510/

• Thesis

[S15] Savas Frederik Nikolaisen. (2015).Forecast Comparison of Models Based on SARIMA


and the Kalman Filter for Inflation. Thesis Advanced Level. Supervisors : Lars Forsberg.
Sweden. Uppsala Universitet. 66p.

• Websites

[DRPWGT] Brownlee Jason. (2016). Dropout Regularization in Deep Learning Models


With Keras. [online] Available at: https://machinelearningmastery.com/dropout-regularization-
deep-learning-models-keras/ [Accessed 29 May 2018].

160
[GRULSTM] Chung Junyoung, Gulcehre Caglar, Cho KyungHyun, and al (2014). Empirical
Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. [online] Available
at: https://arxiv.org/pdf/1412.3555v1.pdf [Accessed 21 May 2018].

[GRUVAR] Dey Rahul, Salem, Fathi M.(2017) Gate-Variants of Gated Recurrent Unit
(GRU) Neural Networks [online] Available at: https://arxiv.org/ftp/arxiv/papers/1701/1701.05923.pdf
[Accessed 21 May 2018].

[GRU] Kostadinov Simeon. (2017). Understanding GRU networks. [online] Available


at: https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be [Accessed 21
May 2018].

[K14] Kourentzes Nikolaos (2014). Additive and multiplicative seasonality – can you iden-
tify them correctly? [online] Available at:
http://kourentzes.com/forecasting/2014/11/09/additive-and-multiplicative-seasonality/ [Ac-
cessed 10 May 2018].

[minibatch] Li Mu, Zhang Tong, Chen Yuqiang, and al (2014). Efficient Mini-batch Training
for Stochastic Optimization. [online] Available at: https://www.cs.cmu.edu/ muli/file/mini-
batch_sgd.pdf [Accessed 21 May 2018].

[NN2] Neural Networks (2013) [online] Available at:

http://ufldl.stanford.edu/wiki/index.php/Neural_Networks [Accessed 21 May 2018].

[NN1] Nielsen Michaël. Neural Networks and Deep Learning, Determination Press, 2015.
[online] Available at: http://neuralnetworksanddeeplearning.com/chap1.html [Accessed 21
May 2018].

[hardsigmoid] Osserman Harris.(2017) Aspects of Deep Learning: Activation Functions.[online]


Available at: https://qph.ec.quoracdn.net/main-qimg-2fd3181b8ebfab960a8012d0b92a09a8
[Accessed 21 May 2018].

[SSPSK13] Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-
of-Fit?.(2013) [online] Available at: http://blog.minitab.com/blog/adventures-in-statistics-
2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit [Accessed 30
May 2018].

161
[GK18] Still G. Keith. (2018). Crowd Safety and Risk Analysis.[online] Available at:
https://www.gkstill.com/ [Accessed on 23 May 2018].

[NN3] Surmenok Pavel. (2017). Estimating an Optimal Learning Rate For a Deep Neu-
ral Network. [online] Available at: https://towardsdatascience.com/estimating-optimal-
learning-rate-for-a-deep-neural-network-ce32f2556ce0 [Accessed 21 May 2018].

[WIR07] Wireless Networking in the Developing World.(2006). [online] Available at:


http://www.vias.org/wirelessnetw/wndw_05_04.html [Accessed 2 May 2018].

[ADADELTA] Zeiler Matthew D.(2012). Adadelta: An Adaptive Learning Rate method.[online]


Available at: https://arxiv.org/pdf/1212.5701.pdf [Accessed 21 May 2018].

162

You might also like