Travel Time Prediction Using Random Forest

Travel Time Prediction Using Random Forest
PRANESH CHAITRA
SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
2019
Travel Time Prediction Using Random Forest
PRANESH CHAITRA
SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
A DISSERTATION SUBMITTED IN PARTIAL FULFILMENT OF
THE REQUIREMENTS FOR THE DEGREE OF MASTER OF
SCIENCE IN COMPUTER CONTROL&AUTOMATION
2019
TABLE OF CONTENTS
LIST OF TABLES...................................................................................................................................6
LIST OF FIGURES.................................................................................................................................7
CHAPTER 1 – INTRODUCTION..........................................................................................................8
1.1 Travel time prediction..................................................................................................................8
1.2 Motivation and Background........................................................................................................8
1.3 Organization of report..................................................................................................................9
CHAPTER 2 – LITERATURE REVIEW.............................................................................................10
2.1 Introduction................................................................................................................................10
2.2 Intelligent Transport System (ITS)............................................................................................10
2.3 Traffic data collection................................................................................................................11
2.4 Time series methods..................................................................................................................12
2.5 Machine learning methods.........................................................................................................13
2.6 Deep learning methods..............................................................................................................16
3.7 Implementation..........................................................................................................................17
CHAPTER 3 – MACHINE LEARNING ALGORITHMS...................................................................18
3.1 Introduction................................................................................................................................18
3.2 Machine learning algorithms.....................................................................................................18
3.3 Random forest............................................................................................................................22
3.3.1 Bagging......................................................................................................................................22
3.3.2 Decision trees.............................................................................................................................23
3.3.3 Random forest as a classifier.....................................................................................................24
3.3.4 Random forest in regression......................................................................................................26
3.3.5 Advantages and disadvantages of random forest algorithm......................................................27
CHAPTER 4 – METHODOLOGY.......................................................................................................28
4.1 Introduction................................................................................................................................28
4.2 Project pipeline..........................................................................................................................28
4.2.1 Traffic data collection................................................................................................................29
4.2.1 Preprocessing of data.................................................................................................................32
4.2.3 Dataset preparation....................................................................................................................33
4.2.4 Travel time prediction using random forest algorithm..............................................................34
CHAPTER 5 – RESULTS.....................................................................................................................37
5.1 Introduction................................................................................................................................37
5.2 Performance Evaluation.............................................................................................................38
5.2.1 Performance evaluation in temporal domain.............................................................................38
5.2.2 Performance evaluation in spatial domain................................................................................41
5.2.3 Spatiotemporal error patterns....................................................................................................44
5.2.4 MAPE variations for all segments.............................................................................................46
5.3 Comparison of random forest model with other models...........................................................47
CHAPTER 6 – CONCLUSION AND FUTURE WORK.....................................................................48
6.1 Conclusion.................................................................................................................................48
6.2 Future work................................................................................................................................49
REFERENCES......................................................................................................................................50
ABSTRACT
Traffic jam is considered as one of the major problems causing hindrance in the growth of the
city. Effective measures must be taken to avoid traffic jams, which will in turn lead to the
development of the city. A system which can accurately predict the traffic in different
situations, location and time should be developed. With this, we will be able to forecast the
potential traffic jams.
Machine Learning and deep learning methods are gaining a lot of importance in travel time
prediction. They can give promising results. Since the traffic data is large, random forest
algorithm can successfully handle this and give accurate results. Random forest is a
supervised and an ensemble learning method which can be used for both classification and
regression. Multiple decision trees are built and merged together to get more stable and
accurate prediction. In this paper, the travel time is predicted using random forest algorithm.
The performance of the model is very high and the predicted travel time has high level of
accuracy compared to other traditional methods such as Support Vector Machine (SVM),
historical average, neural networks and simple linear regression.
Key words: Machine Learning, Deep Learning, Random forest, Support Machine, Neural
networks, linear regression
ACKNOWLEDGEMENT
It has been an immense pleasure to work on the master’s thesis research, which proved to
enhance my skills and provoke my knowledge. I would like to express my sincere gratitude to
Professor Justin Dauwels, whose guidance and direction made sure the research was on the
right track. Having offered the freedom to work on research, in addition to academics, I feel
indebted to the Professor for the same.
Sincere thanks to Dr. Anil Kumar Bachu and Dr. Saratchandra Nagavarapu for their constant
support, guidance, motivation, and insight throughout this project. They have spent ample
time and consistently guided me in every step of the project and helped me bring this study
into success.
I would like to put forth my special thanks to my friend Rakesh Reddy for proofreading my
dissertation.
Last but not least, I would like to deliver my special thanks to my friends and family for their
constant support and love, which has helped me in finishing the project successfully.
LIST OF TABLES
Table 4.1 Length of segments.........................................Error: Reference source not found

Table 4.2 Dataset division..............................................Error: Reference source not found
LIST OF FIGURES
Fig 3.1 Supervised learning model....................................................................................19

Fig 3.2 Unsupervised learning model...............................................................................19
Fig 3.3 Reinforcement learning.........................................................................................20
Fig 3.4 Difference between learning algorithms...............................................................21
Fig 3.5 Bootstrapping........................................................................................................22
Fig 3.6 Example of a decision tree....................................................................................24
Fig 3.7 Classification using random forest.......................................................................25
Fig 3.8 Random forest for regression................................................................................26
Fig 4.1 Implementation structure of the project................................................................29
Fig 4.2 Segments in Westbound and Eastbound line........................................................30
Fig 4.3 Flow chart of the random forest model.................................................................35
Fig 4.4 Variations in MAPE with the number of trees......................................................36
Fig 5.1 Actual vs predicted travel time for test day 1 in segment 9..................................38
Fig 5.2 Actual vs predicted travel time for test day 1 in segment 12................................38
Fig 5.3 Actual vs predicted travel time for test day 2 in segment 9..................................39
Fig 5.4 Actual vs predicted travel time for test day 2 in segment 12................................39
Fig 5.5 Variations in MAPE in segment 9 for different test days.....................................40
Fig 5.6 Variation in MAPE in segment 12 for different test days.....................................40
Fig 5.7 Traffic state of day 1 in the interval 12.30 to 12.33 AM.......................................41
Fig 5.8 Traffic state of day 1 in the interval 11.42 to 11.45 PM.......................................42
Fig 5.9 MAPE variations across all time slots for test day 1............................................42
Fig 5.10 MAPE variations across all time slots for test day 2..........................................43
Fig 5.11 Heatmap representing spatiotemporal error patterns in terms of MAE..............44
Fig 5.12 Heatmap representing spatiotemporal error patterns in terms of MAPE............45
Fig 5.13 Variations in MAPE for all segments..................................................................46
CHAPTER 1 – INTRODUCTION
1.1 Travel time prediction
Traffic jams are becoming very common in cities around the world and Singapore is no
exception. It increases travel times and travel delays during peak hours, causing a chaos. For
the sustainable development of the nation, effective measures must be taken to avoid such
negative impacts of traffic jams.
Traffic jams may be potentially avoided or at least they can be limited upon appropriately
guiding the drivers about the traffic situation. This is an important issue in the area of
Intelligent Transport System (ITS) and Advanced Intelligent Transport System (AITS). In
terms of public transportation, the travel time of city bus has obvious characteristics of travel
time distribution, especially in the morning and evening rush hour. Many problems like long
waiting time, uneven bus arrival times, reduced reliability and so on affects the efficiency of
the system, service attractiveness and passengers’ willingness to take public transportation.
For this, one should be able to track and predict the traffic flow in real-time. This travel time
information will help to save travel time and the travel routes can be selected pre-trip, which
will in-turn improve the reliability and operational costs of the transportation system.
1.2 Motivation and Background
Currently, there are up to 900 million of vehicles which are running on our road network.
Regardless of their type, the vehicles are very crucial for human mobility. Also, due to this
there is a drastic increase in pollution and road congestion. Intelligent transportation systems
(ITS) have developed around the world as part of smart cities, integrating various
technologies like cloud computing, the Internet of Things, sensors, artificial intelligence,
geographical information, and social networks.
In developed countries, expanding roadway infrastructure is becoming less of an option for

transportation and government agencies due to environmental, financial and social
constraints.
Under these circumstances, monitoring and disseminating travel time information through
Advanced Traveler Information Systems (ATISs) drivers can make better travel decisions.
The innovative services provided by ITS can improve transportation mobility and safety by
making road users better informed and more coordinated, which helps in addressing the
transportation issues caused by the significant increase in city traffic in the past few decades.
Traffic prediction is one of the key tasks of ITS. It provides essential information to road
users and traffic management agencies to allow better decision making. It also helps to
improve transport network planning to reduce common problems, such as road accidents,
traffic congestion, and air pollution.
1.3 Organization of report
The current chapter deals with the Intelligent Transport Systems (ITS), while explaining the
importance of travel time prediction in increasing the efficiency of transportation systems.
Chapter 2 deals the Literature review to exhaustively analyze what work was done in this
field until now.
Chapter 3 introduces the basics of machine learning and gives a detailed explanation of the
techniques used in the project.
Chapter 4 deals with the methodology and design followed in the project and give a brief
explanation of the elements included. It also familiarizes us with the technical terms often
encountered in the report.
Chapter 5 deals with the experimental comparison of the results produced by the model and
comparison of the performance of the random forest model with various other models.
Chapter 6 gives us the conclusion and future work of the problem addressed
The final section of the report includes the references used throughout the project
CHAPTER 2 – LITERATURE REVIEW
2.1 Introduction
A large number of researchers have dealt with the prediction of travel time on road networks.
This chapter guides us through the approaches taken up for implementation of the system. It
gives a comprehensive understanding of the work that’s been done so far in this field, what
are the current techniques being used and what’s the best method that can be adapted to
approach this dissertation problem statement.
2.2 Intelligent Transport System (ITS)
The population across the world is increasing at an enormous rate which is a result of
population growth, changes in population density and urbanization. The world economy is
also increasing at a greater pace. There is a greater need for mobility and road transportation
is easily accessed by people. This has led to an increase in traffic congestion. Congestion
increases air pollution, travel time and fuel consumption. It affects the transportation
infrastructure and efficiency is also significantly reduced. Across the world, there is an
increase in the number of accident cases due to the development of roads in present days.
Such problems can be overcome by Intelligent Transport Systems (ITS) [1].
ITS is a real-time information generator scheme and an advanced application which provides
innovative services regarding various modes of transport and traffic management. The traffic
problems are minimized in order to achieve higher traffic efficiency. ITS enriches users with
prior information regarding the travel time, traffic, availability of seats, real-time running
information etc. With ITS, there is a growth in smart cities and the travel time of commuters
is significantly reduced. The safety and comfort of the citizens’ increase.
The performance of the real-time information generation scheme, which is commonly
deployed is reviewed by Oded Cats and Gerasimos Loutos [2]. The current network
conditions determine the real-time information. This should yield more accurate predictions
of the travel time. The uncertainty in the behavior of the driver, traffic conditions and the
dwell time, accuracy and reliability of the real-time information systems decreases.
Control strategies and operational planning have been employed to improve the reliability of
public transportation systems [3]. Operational planning strategies included network
definition, scheduled planning, definition and assignment of duties. Control strategies were
used to restore service normalities when there are deviations.
With the use of Intelligent Transport System (ITS) and Advanced Traveler Information
System (ATIS), there is an improvement in public transport reliability. The information
services provided increases the passengers’ satisfaction which encourages the use of public
transport over personal vehicles. As a result, the environment can be safeguarded from
pollution and road congestion can be decreased.
2.3 Traffic data collection
Traffic data collection is an important step in any traffic-related studies and research. The
transparency in the information is required and the data collected must be reliable, precise
and of high quality. ITS captures roadway information, from vehicles passing through a given
point. Their average speed is also captured. The positioning of vehicles can also be followed
using satellite-based systems or mobile phone tracking. Traffic data will be conveyed to the
focal units where it is totaled and changed into data which can be utilized further like
guaranteeing effective transportation of street systems.
In reference [4], the various methods of data collection have been discussed in detail. The ITS
collects data using three different techniques namely Site based data collection, floating car
data and wide area data collection. Site-based data collection includes video graphic method
and infrared based method. Using these methods, the data collected is of high accuracy but it
requires high maintenance and implementation costs. Also, there is limited data coverage.
The floating car method is a low-cost, GPS based method which works in all weather
conditions [5]. The wide area data collection method uses satellite sensors, RFID technology,
mobile telephony and Dedicated Short Range Communication (DSRC). Probe vehicles are a
central traffic management center where vehicles equipped with wireless technology like
DSRC report speed and other information. The aggregated probe data identifies congested
locations. The probe vehicle systems have continuous and automated data collection.
The information gathered from Inductive Loop Detectors (ILDs), Dedicated Short Range
Communication (DSRC), Toll Collection System (TCS) and probe vehicles can be fused [6].
This results in a hybrid data and it overcomes any problem of missing data. This kind of data
can also be used for long term predictions. In reference [7], the data collected from the
Vehicle Detector System (VDS) and Automatic Toll Collection (ATC) system, are combined.
Since the data is combined, the limitations of both the systems are minimized and accuracy is
enhanced. It is free from the issues related to inadequate data samples.
Once the traffic information is collected from various sources and methods, different methods
and algorithms can be used to predict the travel time of vehicles.
2.4 Time series methods
Various methods have been adopted to predict the travel time of vehicles. Time series method
is a popular method among them. There is a temporal ordering in the time series data. The
data can be analyzed and meaningful statistics and other characteristics can be extracted
using time-series analysis. The behaviors in the past data are used to estimate future values.
Simple moving average (SMA) is one of the easiest forecasting techniques. It’s a simple
average of last N data points. Moving average is utilized to smooth out inconsistencies (tops
and valleys) to effectively perceive trends. A classic time series model for travel time
prediction is the autoregressive and moving average (ARMA) model. The performance of the
model and the accuracy level is very high and the complexity of these models are also very
high. In reference [8], the ARMA model is combined with particle swarm optimization (PSO)
algorithm to optimize the solving process of the ARMA models. On combining with PSO, the
ARMA model performed much better and the Mean Absolute Percentage Error (MAPE)
significantly decreased.
Autoregressive integrated moving average (ARIMA) is used to predict future, using time
series data when the data is consistent and the outliers are minimum. ARIMA modeling will
take care of trends, seasonality, cycles, errors and non-stationary aspects of a data set when
making forecasts. The ARIMA model can also be extended to incorporate seasonality. The
Auto-correlation Function (ACF) and the Partial Auto-correlation Function (PACF) is used to
determine the models [9]. If the series is not stationary, the ARIMA model is used and if the
original data series is stationary the model reduces to ARMA model. With the selected model,
forecasting future for either one period or several periods is done. The prediction made was
only based on the historical travel time data, other factors like road and traffic conditions
were not considered. The moving average models obtained had a minimum mean absolute
relative error (MARE) and mean absolute percentage prediction error (MAPPE) values.
One of the most popular optical estimators is the Kalman filter. The parameters of interest are
incurred from uncertain, indirect and inaccurate observations. This method is very
convenient for online real-time processing and the best estimate is identified by filtering the
noise. In probe-based traffic information, if there is limited probe data, Kalman filters
combined with variable aggregation interval scheme is used [10]. Short term travel time is
predicted using variable aggregation interval scheme. Using this, the accuracy increased by
40% compared to a fixed aggregation interval under free-flow conditions. Kalman filter
constantly updates its parameters to predict the required state variables, as new state variables
are obtained. The performance of the model largely depends on the consistency between
historical and the current time travel time patterns.
Time series model mainly depends on the similarity between the future information and
historical information. If the average situation of the historical data changes, it will lead to an
obvious deviation of the prediction results.
2.5 Machine learning methods
The trend is slowly shifting towards machine learning algorithms to solve travel time
prediction problems. There are so many machine learning algorithms that can be put to use
for such purposes [11]; like the traditional classifiers such as k-Nearest Neighbour, Support
Vector Machine, Random Forest, Neural Networks and deep learning techniques like
Convolutional Neural Networks, etc.
Support Vector Regression (SVR) is supervised learning and a regression algorithm. The
lower dimensional data is mapped to higher dimensional data by the kernel and a hyperplane
is used to predict the target value. The error is tried to fit within a threshold. A decision
boundary is designed such that the closest point to the hyperplane or the support vectors are
within that boundary line. In [11], a time estimation model is built considering, the variables
associated with the vehicles’ movement. The variables include segment distance, the hour of
the day, date, the day of the week etc. A large number of variables are included which makes
the solution complex. The estimated time will get affected by any small change in the
behavior.
Regression models estimate the values of dependent variables from the values of independent
variables. It tells us which inputs are more or less important. Artificial Neural Networks
(ANNs) gives better results than regression methods like SVR, in predicting travel time [12].
ANN is a machine learning algorithm which is based on the model of a human neuron. It
consists of an input layer, hidden layers, and an output layer. ANN learns the data fed to it
and often correctly infer the unseen part of a population, even if the data contains noisy
information. Multi-Layer Perceptron (MLP) is chosen since it has a very good capability of
arbitrary input-output matching. There are chances of overtraining which will cause
memorization and failure of a few patterns’ recognition [13]. The performance of the ANN
model is evaluated using the coefficient of co-relation, Root Mean Square Error (RMSE),
Mean Absolute Percentage Error (MAPE) and standard deviation.
SVM or Support Vector Machine is a supervised machine learning algorithm, which is

usually used either for regression or classification problems. The data points are plotted in an
n-dimensional space (where n represents the number of features) representing the value of
that feature at that particular coordinate. Classification is done by developing a hyperplane to
differentiate the features. Support Vector Machine (SVM) is superior to the neural network
and can be used to predict the travel time [14]. It is based on statistical learning. It overcomes
difficulties such as non-linear and dimensional disaster problem, overlearning and local
minima problems.
SVM is combined with Weighted Moving Average (WMA) [15] to eliminate unwanted
fluctuations in the data set. In the weighted moving average method, the recent historical data
is weighted more heavily than the older data. It has a good generalization ability and a strong
learning ability. The parameters in SVM govern the training process. In reference [16],
Genetic Algorithm (GA) is used with SVM for predicting the travel time. It is superior to that
of traditional SVM and ANN in terms of accuracy. Along with inputs like the length of road,
weather, bus speed and rate of road usage, a search algorithm is combined. GA is adopted to
optimize the learning parameters of SVM. This model predicts the bus arrival time
dynamically but with less calculation and high accuracy. GA helps in finding the optimal
parameters combination quickly. It is simple in program implementation, less in setting
parameters and fast in calculating converging speed.
k-NN or k- nearest neighbour is learning algorithm which does not do iterative learning but
simply decides the classification of an unknown object based on the closest neighbours to the
object to be classified. k-NN model is developed to detect the travel time [7]. The model is
easily transferable and good results are obtained except for trips with longer travel times.
Additive models are also used for Travel time prediction [17]. The additive models consist of
a framework that allows for flexible modeling of the bus travel times. The data usually gives
the relationship between bus movements in time and space. The relationship between travel
time and predictor variables like the day of the week, the hour of the day and traffic
conditions must also be considered. With the use of additive models, there is ease in
interpretability and flexibility in modeling complex non-linear relationships. The
performance is also better.
Clustering is an unsupervised machine learning technique that involves grouping of data

points. K-means clustering is the most well-known in which a number of classes or groups
are selected and randomly initialized with their respective center points. Each data point is
classified into the group whose center is close to it. Travel time is predicted using a modified
K-means clustering technique [18]. Historical data is clustered based on travel time, the
frequency of travel time and velocity for a particular road segment and time group. This
method is proved to be better than Naive Bayesian Classification (NBC), Chain Average
(CA) and Successive Moving Average (SMA). The method is very simple and the speed is
high. In the regular clustering methods, with each run, the same results are not obtained.
Using modified K-means clustering this short-coming is eliminated. Two centroids are fixed
and two clusters will be analyzed which will address the uncertain situations.
Random forest is an ensemble learning method and a supervised learning algorithm. It can be
used for both classification and regression. Multiple decision trees are built and classification
is performed on the basis of the highest number of votes. For regression, the mean prediction
of the individual trees is considered. The complete algorithm of random forest is explained in
detail in the forthcoming chapter. Random forests perform well in prediction of travel time.
The method is used to predict traffic at intersections as well [19]. Traffic predictions at
intersections are quite challenging as it involves various participants like vehicles, cyclists
and pedestrians. The features selected in this model are namely the day of the week, weekend
or weekday, peak or off-peak and event distance. Spatio-temporal speed measurements are
also used by random forest algorithms to make accurate travel time predictions [20]. The
relationship between the predictors (feature vectors) and travel time is modeled using the
random forest. It is observed that there was more than 38 % and 28 % reduction in the
prediction error on congested days compared to practice instantaneous algorithm and genetic
programming algorithm for travel time prediction respectively.
Random forest also gives travel time reliability without any extra processing. It is considered
as one of the best machine learning algorithms as there is no problem of overfitting.
2.6 Deep learning methods
Deep learning models have gained increased attention within the Artificial Intelligence
community [21]. They have a very high prediction accuracy. Using deep learning methods, a
single model can be built to make accurate predictions for all the segments in the network
instead of an individual model for all segments. Long short-term memory (LSTM) and
Convolutional Neural Networks (CNN) are the widely used deep learning algorithms used for
travel time prediction. LSTM is a special RNN structure. CNN is an image recognition
algorithm applied in Computer Vision. Both of these models have several hidden layers. CNN
learns traffic as images and then make predictions. The results obtained by these deep
learning methods have a high rate of accuracy.
3.7 Implementation
Alireza Ermagun and David Levinson review studies that forecast traffic conditions using
spatial dependence between links [22]. Two perspectives namely methodological frameworks
and methods for capturing spatial information have been considered. Spatial information
boosts the accuracy of prediction, especially in congested and longer horizons. Machine
Learning outperforms the naïve statistical methods such as historical average and exponential
smoothing. This is not guaranteed with respect to advanced statistical methods such as
spatiotemporal and ARIMA. The spatial components and their role in traffic forecasting,
capturing and embedding of spatial dependence in forecasting methods and the extent of
dependency between the links are discussed.
Machine learning and regression methods are compared with the time series methods with
respect to travel time prediction of vehicles [3]. Regression methods are capable of estimating
the impact that each input variable has on the target variable. Methods like Artificial Neural
Network (ANN), Support Vector Regression (SVR) and k- Nearest Neighbor (kNN) have a
good ability to find the complex non-linear relationship between independent and target
variables. The state-based and the time series models rely most on the recent data samples
and they are not depending on the quantity of the data. The training period is not large.
It is shown in reference [21] that for both congested and uncongested traffic conditions, deep
learning methods can be used efficiently. These methods offer a promising approach to real-
time prediction of travel times on a network scale. A single model can be built to predict the
travel time of vehicles in various segments. This increases the efficiency of the model as well.
The error percentage obtained is comparatively less than the other traditional methods.
In the random forest algorithm, many decision trees are built and merged together to obtain a
more stable and accurate prediction. This algorithm performs well for both classification and
regression problems. The results obtained using random forests are good due to a wide
diversity in features and the problem of overfitting is also prevented.
In this dissertation, the random forest algorithm is adopted to predict the travel time since the
model is very flexible and can be developed in a short period of time. The prediction results
are also very accurate since it provides a lot of importance to its features.
CHAPTER 3 – MACHINE LEARNING ALGORITHMS
3.1 Introduction
This chapter gives us an introduction to the types of machine learning algorithms and the
classifying techniques that can be used for each kind of algorithm. A detailed understanding
of random forest which is used in this project is also provided so that the further
implementation and the results presented can be well interpreted.
3.2 Machine learning algorithms
Machine learning algorithms can be broadly classified into the following categories, based on
the type of learning:
 Supervised Learning:
In supervised learning, the input data is labeled. A general rule is learned which maps
inputs to outputs. The training data that is used to train the classifier can be corrected
when it is learning the data wrong. The training is continued until the classifier
achieves a desired level of accuracy.
The prominent supervised learning algorithms are:

 K- Nearest Neighbor
 Naives Bayes
 Decision Trees
 Linear Regression
 Support Vector Machine
 Neural Networks
 Convolutional Neural Network
 Random Forest
Fig 3.1 Supervised learning model
 Unsupervised Learning:
In this type of learning the input data is unlabeled, so there is no way of knowing if
the classifier is training correctly like in supervised learning. The system has to look
for patterns or rules to help understand the data better. The most common
unsupervised learning algorithm is k- means clustering algorithm.
Fig 3.2 Unsupervised learning model
 Semi-Supervised Learning:
When the data is not completely labeled or unlabeled it falls under the category of
semi-supervised. The cost to label the entire data available might be too high and only
major part of the data is labeled. This method is suitable for model building.
 Reinforcement Learning:
In this type of machine learning the system makes specific decisions by exposing
itself to the environment. It trains itself by the continuous method of trial and error. It
tries to capture the best knowledge from the past experience. A simple reward
feedback is given for the software agent to learn from the environment known as
Reinforcement signal.
The most commonly used algorithms are:

 Q-Learning
 Temporal Difference (TD)
 Deep Adversarial Networks
Fig 3.3 Reinforcement learning
The following figure shows the difference between the learning algorithms:
Fig 3.4 Difference between learning algorithms

In this project, we are using random forest, a machine learning algorithm which is classified
as supervised learning.
3.3 Random forest
Random forest is one of the flexible and easy to use Machine learning algorithm. Its
simplicity makes it one of the most used algorithms. Random forest is a supervised machine
learning algorithm. As suggested by its name, it creates a forest with a number of trees. The
robustness of the algorithm increases with the number of trees. Higher accuracy is obtained
with many trees in the forest. The important feature of this algorithm is, it can be used for
both classification and regression.
The forest built by the algorithm is an ensemble of decision trees and usually trained with the
bagging method. Multiple decision trees are built and merged together to obtain a more
accurate and stable prediction.
3.3.1 Bagging
Bagging is an ensemble technique. Several decision tree classifiers are combined to produce
better predictive performance than a single decision tree classifier.
The quantities about a population are estimated by averaging estimates from multiple small
data samples using a statistical technique known as the bootstrap method. Given a large data
sample, multiple samples are built by drawing observations from it. The observations are
drawn one at a time and returned to the sample after they have been chosen. Hence, a
particular observation can be included in a sample more than once. It’s also known as
sampling with replacement.
Fig 3.5 Bootstrapping

The above figure illustrates bootstrapping. In bootstrapping, the model is trained with the
data samples and the model is used to predict the samples which are not selected. The
samples not selected are referred to as “out-of-bag (OOB)” samples.
Bagging or bootstrap aggregation is the application of the bootstrap procedure to a high

variance machine learning algorithm, such as decision trees.
3.3.2 Decision trees
Decision trees are used visually and explicitly to represent decisions and decision making.
The decision tree has its root at the top and it is drawn upside down. If the trees are used in
classification, they are called as classification trees and if they are used to predict continuous
values, then they are called as regression trees. In general, they are known as CART –
Classification and Regression trees.
All the features are considered for growing the trees. The trees are split at the nodes called the
internal node on the basis of a cost function. The split that costs the least is chosen and hence
it is known as a greedy algorithm. Gini score is the measure of how best the node splits. The
objective of growing these trees is to have pure nodes. Pure nodes are the nodes where all
samples belong to the same class. The worst purity is when a node has 50-50 splits of
samples. Gini impurity, the threshold should be optimized such that Gini before split – Gini after split is
the largest. The tree stops splitting when it reaches a node called as leaf node.
G = sum (pk * (1 — pk)), where G is the Gini score and pk is proportion of the same class
inputs present in the particular group. For the best split, pk is either 1 or 0 and G is 0 whereas for
the worst split, pk is 0.5 and G is 0.5.
Pruning method is adopted to improve the performance of a tree. The branches that contains
features having low importance are removed. By this, the complexity of the tree is reduced and the
predictive power increases. Overfitting of the samples also reduces through pruning.
Fig 3.6 Example of a decision tree
The above figure demonstrates a simple decision tree when a display is broken in a unit.
3.3.3 Random forest as a classifier
Random forests behave slightly different from decision trees. Among the random subsets of
features, the best feature is selected by random forest. Where as in decision trees, the most
important feature is searched when the node splits. The relative importance of each feature on
the prediction is measured by the random forests. Deep decision trees sometimes, can suffer
from the problem of overfitting. Since random forests create subsets of features randomly
building smaller trees, it doesn’t face the issue of overfitting.
Random forest is an ensemble method in which a group of weak learners come together to
form a strong learner. Thus, the accuracy of the model increases. When several decision trees
are combined together, they perform better than a single decision tree. Ensemble methods
help in reducing factors like variance, bias and noise which act as the main source of
disturbance between actual and predicted values.
In classification problems, the concept of “majority voting” is considered. The test features
are considered, and the outcome is predicted. This prediction is made based on the rules of
decision trees. The decision trees are randomly generated. The number of votes for each
predicted target is counted. The target which receives the highest number of votes is
considered as the final prediction of the random forest algorithm.
For example, if 100 decision trees are randomly formed in the random forest, each tree will
predict a different target for the same test feature. In case, 70 decision trees predict the target
value as A. The target ‘A’ has received maximum number of votes compared to targets ‘B’
and ‘C’. Hence, the random forest classifier, returns ‘A’ as the predicted target.
The figure shown below illustrates classification using random forests.
Fig 3.7 Classification using random forest

3.3.4 Random forest in regression
In regression trees, the targeted value is a real valued number, regression model is fit to the
target variable using each of the independent variables. Then for each independent variable,
the data is split at several split points. The Sum of Squared Error (SSE) is calculated at each
split point between the predicted value and the actual values. The variable resulting in
minimum SSE is selected for the node. Then this process is recursively continued till the
entire data is covered.
Fig 3.8 Random forest for regression
The above figure is an example of regression using random forest algorithm.
Hence, random forest algorithm can be used for solving both classification and regression
problems.
3.3.5 Advantages and disadvantages of random forest algorithm
There are many advantages and few disadvantages of using random forest algorithm.
Advantages:
 Used for both classification and regression problems

 Good results are obtained with the default hyperparameters
 The relative importance assigned to the input features can be easily viewed
 The model can be developed easily
 No problem of overfitting
Disadvantages:
 If number of trees are very large, the algorithm becomes slow and ineffective for real
time predictions
 Training of algorithm is fast but it is little slow in prediction
 For higher accuracy of the models, more trees are required. Thus, the model becomes
slow
In most of the real-world algorithms, random forest algorithm is fast enough. Random forests
are generally used in applications like stock market, E- commerce, medicine, banking and
various other sectors.
In this dissertation, random forest algorithm will be used to predict the travel time.
CHAPTER 4 – METHODOLOGY
4.1 Introduction
This chapter guides us through the approach taken up for implementation of the system. It
gives us a brief understanding of the key steps involved in the project and their functions. It
also introduces many terms and specifications that will exhaustively be used in the chapters
to come.
4.2 Project pipeline
The main aim of this project is to be able to predict the travel time taken by vehicles to move
from one segment to another. Since there are already many systems designed to predict the
travel time, the primary challenge of the system is to predict the travel time with high
accuracy and less deviation from the expected values using the random forest algorithm.
The entire system can be broken down into three main parts:
 Traffic data collection

 Pre-processing of data
 Dataset preparation
 Application of random forest algorithm to the data set, to predict the travel time
The following figure gives us a structure of the approach selected to implement the project:
Traffic data collection
Data pre-processing
Data set preparation
Travel time prediction

Fig 4.1 Implementation structure of the project
4.2.1 Traffic data collection
As discussed earlier in chapter 2, traffic data collection is an important step in any traffic-
related studies and research. For prediction of travel time between segments, traffic data of
the segments must be collected for many days so that the model learns all the patterns and
variations in the data and make accurate predictions.
In our study, we are utilizing the data collected by LTA Singapore, for the Westbound line.
LTA has collected data from 29 segments in the Westbound line.
The figure below represents the segments in the Westbound and the Eastbound line.
Fig 4.2 Segments in Westbound and Eastbound line
The travel time between segments has been collected for over a period of sixteen months
(Nov 2008 – Feb 2010). The frequency of data collection is three minutes.
The 29 segments considered along the Westbound line are not of uniform length. The length
of the segments varies in the range of 500m to 6000m.
Segment number Segment name Length (in meters)
1 40010 2000
2 40015 500
3 40020 2000
4 40025 500
5 40030 500
6 40035 500
7 40040 1500
8 40045 2000
9 40050 1500
10 40055 500
11 40060 2000
12 40065 500
13 40070 4000
14 40075 500
15 40080 1000
16 40085 500
17 40090 1000
18 40095 2500
19 40100 500
20 40105 1000
21 40110 500
22 40115 3500
23 40120 3500
24 40125 500
25 40130 6000
26 40135 500
27 40140 3000
28 40145 1000
29 40150 2000
Table 4.1 Length of segments
The above table gives us the length of the 29 segments considered.
The data collected in the period of 16 months, has few missing values. Not on all days, the
data is collected for a frequency of three minutes. It is observed that from the 2 nd of March
2009 to 1st of July 2009, there is no missing data. The frequency of data collection remains
the same throughout this period. Hence, to make better analysis we have considered travel
time data only in this particular period. Since there is no missing data, the traffic pattern in
the segments can be identified easily.
4.2.1 Preprocessing of data
The raw data collected cannot be used directly. The data contains erroneous values known as
outliers. The outliers are due to experimental errors or they are the variations in the data.
Outliers due to experimental errors must be removed. Removal of outliers is a very important
step in data analysis. Outliers should not be the basis of the results.
Outliers in the data can be identified using various methods, namely:
 Box and whisker plots

 Scatter plot
 Z-score etc.
Once the outliers are removed, the data has to be trimmed or the gaps must be filled. The
gaps can be filled with:
 Nearest good data

 Mean of the data
 Median of the data etc.
In this project, the entire processing of the data, the creation of data sets and the application
of the random forest algorithm on the data sets are completely done in MATLAB. MATLAB
being a powerful tool, removes outliers in the data sets as well. We have used the ‘fill-
outliers’ command to identify the outliers and fill the gaps with ‘mean’ value of the data set.
In this, each segment is considered separately. All the data points in every segment are
considered and any point more than three standard deviations from the mean is identified and
replaced. This method is faster and robust in performance.
4.2.3 Dataset preparation
After the removal of outliers from the data and filling it with the mean value of the particular
segment, datasets must be prepared. The following datasets must be prepared:
 Training dataset
 Cross-validation dataset
 Testing dataset
Training dataset is the dataset used by the random forest model to learn the data. The model
makes predictions based on the data provided in the training dataset. Cross-validation dataset
is used to tune the hyperparameters of the model to obtain a high accuracy of prediction. The
validation dataset is also known as the development set. Usage of cross-validation datasets
helps in the comparison of the performance of the model with respect to various parameters.
The parameters which give the best results can be chosen accordingly. The cross-validation
dataset functions as a hybrid. It is a training data used for testing. Test dataset is independent
of the training dataset. The prediction algorithm that is chosen, is applied to the test dataset to
check the performance of the model. We can test our algorithm’s performance on the unseen
data.
From our data, we have chosen 88 days from the period (2 nd March 2009 – 1st July 2009) for
preparation of data sets. Training, cross-validation, and testing datasets have been created
accordingly. The table below shows the division of data into different datasets:
DATASET NUMBER OF DAYS
TRAINING – 66 DAYS
TRAINING
CROSS-VALIDATION – 7 DAYS
TESTING 15 DAYS
Table 4.2 Dataset division

4.2.4 Travel time prediction using random forest algorithm
Random forest model must be developed and it has to be trained with training datasets. The
model learns from the training datasets and then predicts the travel time. The variables in the
developed random forest model are:
 Number of trees
 Type of bagging
 Number of previous time steps
The number of trees plays an important role in determining the accuracy of the model. We
need to determine the optimum number of trees which gives the highest accuracy rate or least
Mean Absolute Percentage Error (MAPE).
As discussed earlier, random forest algorithms can be used to solve both regression and
classification problems. In our case, we are using the model to predict the travel time. Hence
the type of bagging is selected as regression.
The traditional traffic analysis period is generally considered as 15 minutes. The frequency of
the data collection is three minutes. Hence, we need to consider the previous five steps to
predict the sixth step.
For example, travel times t1, t2, t3, t4 and t5 are considered to predict the travel time t6 and
travel times t2, t3, t4, t5, and t6 are considered to predict the travel time t7 and so on.
The flowchart given below illustrates the steps involved to estimate the travel time using
random forest model.
Fig 4.3 Flow chart of the random forest model

The steps involved can be summarized as follows:
1) The random forest regression tree is built and it is trained with the training dataset
which includes travel time of 66 days.
2) The cross-validation dataset (travel time data of 7 days) is used to determine the
optimum number of trees required for each segment. The number of trees is a
hyperparameter in this algorithm which determines the performance of the model.
The number of trees is varied from 25 to 500 in steps of 25 trees for all the 29
segments. Mean Absolute Percentage Error (MAPE) is calculated. The number of
trees which gives the least MAPE is chosen as the optimum number of trees for that
particular segment and the model is trained with it.
Segment 1
3.2
3.18
3.16
MAPE
3.14
3.12
3.1
0 2 4 6 8 10 12 14 16 18 20
Numer of trees (x25)
Fig 4.4 Variations in MAPE with the number of trees
3) The travel time of the test dataset is predicted.
4) The deviations in the actual travel time and the predicted travel time is noted. Mean
Absolute Error (MAE) and Mean Absolute Prediction Error (MAPE) is calculated to
evaluate the performance of the model.
CHAPTER 5 – RESULTS
5.1 Introduction
This chapter illustrates the performance of the random forest algorithm in predicting the
travel time between the segments. The performance of the algorithm is measured in terms of:
 Mean Absolute Error (MAE): This error gives the average of the absolute difference
between the actual values and the predicted values for a particular instance.
 Mean Absolute Percentage (MAPE): The size of the error between the actual and the
predicted values is measured in terms of percentage. It is one of the best measures
used in forecasting error and in evaluating the performance of a model.
The lesser the MAE and MAPE, the better and robust the system is. There is an increase in
the accuracy of the model with a decrease in MAE and MAPE values.
This chapter also illustrates the comparison of the performances of various models in
predicting travel time between segments. The random forest model is compared with other
models such as:
 Support Vector Machine (SVM)

 Historical average
 Neural networks and
 Simple linear regression
The above four models are the traditional models used to predict the travel time and they
perform quite satisfactorily. Hence, these methods are considered for comparison with the
random forest model.
5.2 Performance Evaluation
In this section, the performance of random forest algorithm in travel time prediction is
evaluated.
5.2.1 Performance evaluation in temporal domain
1000
900 Actual travel time
800 Predicted Travel time
Travel time (s)
700
600
500
400
300
200
100
0
1 20 39 58 77 96 115134153172191210229248267286305324343362381400419438457
Time index (3 mins each)
Fig 5.1 Actual vs predicted travel time for test day 1 in segment 9
500
400 Predicted travel time
Travel time (s)
350
300
250
200
150
100
50
0
1 19 37 55 73 91 109127145163181199217235253271289307325343361379397415433451469
The above graphs illustrates the actual and the predicted travel time for a particular segment
and a particular day. The ninth segment which is 1.5 km in length and twelfth segment which
is 0.5 km in length is considered and test day 1 is chosen in this particular plot.
It is observed from the graphs that; the predicted travel times are very close to that of actual
travel time and proves that the random forest model is performing satisfactorily.
Similarly, the performance of the model can be evaluated considering any particular day or
any particular segment.

Predicted travel time
Travel Time (s)
800
600
400
200
0
1 20 39 58 77 96 115134153172191210229248267286305324343362381400419438457
Time Index (3 mins each)
450
Travel time (s)
300
250
200
150
100
50
0
1 20 39 58 77 96 115134153172191210229248267286305324343362381400419438457
In the above figure, the performance of the model is evaluated for test day 2 considering the
same ninth segment and the twelfth segment. As observed in the figure there are no such
deviations in the actual and the predicted travel times. This proves that the model is robust
and it performs well for all the test days.
The actual and predicted travel times are used to calculate MAPE. MAPE is an important
measure to evaluate the performance of the model. In terms of temporal variations, MAPE
values can be calculated for all the test days. The maximum and least MAPE values also
depict the model’s performance. Lower MAPE values indicate fewer errors and better
performance.
3.5
3
2.5
2
MAPE
1.5
1
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Day Index
Fig 5.5 Variations in MAPE in segment 9 for different test days
The above figure represents the variation in MAPE values for all the different test days when
a particular segment is considered. In this particular graph, segment 9 is chosen which is 1.5
km in length. It is observed from the figure that the least MAPE value obtained for segment 9
is around 0.49 percent and the highest MAPE value obtained is around 3.33 percent. Hence
the maximum deviation of predicted values from actual values is around 3.33 percent proving
higher accuracy of the model.
3
2.5
MAPE
1.5
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Day Index
Fig 5.6 Variation in MAPE in segment 12 for different test days

The variations of MAPE in segment 12 which is 0.5 km in length is represented in the
previous figure. The minimum MAPE is 0.68 percent and maximum MAPE is around 2.43
percent.
From these previous graphs, it is evident that the random forest model is performing well in
the temporal domain. The system is quite robust and gives good accuracy across all the test
days considered.
5.2.2 Performance evaluation in spatial domain
In the previous section, the behavior of the random forest model in the temporal domain was
discussed. The model is considered to be robust if it performs well in both temporal and
spatial domain. Random forest model should not only perform better for variations in time,
but it should also perform well across different segments as well.
2500
Actual travel time
Travel time (s)
1500
1000
500
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Segment_ID
Fig 5.7 Traffic state of day 1 in the interval 12.30 to 12.33 AM
The above figure shows the actual travel time and the predicted travel time for a particular
time slot 12.30 to 12.33 AM in test day 1 across all the segments. It is clearly observed that
the deviation of the predicted travel time is very less with respect to the actual travel time.
The performance can be evaluated considering another interval. Now the interval considered
is 11.42 to 11.45 PM of test day 1.

Predicted travel time
2000
Travel time (s)
1500
1000
500
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Segment_ID
Fig 5.8 Traffic state of day 1 in the interval 11.42 to 11.45 PM

From this plot, it is clear that the system is performing well across all segments when
different time slots are considered. The deviations of predicted travel time from actual travel
time is very negligible.
In the spatial domain, the MAPE is calculated across all the different segments for all the
time slots. MAPE variations for different days in the spatial domain is calculated and plotted.
5
4
MAPE
1
0
1 19 37 55 73 91 109127145163181199217235253271289307325343361379397415433451469
Fig 5.9 MAPE variations across all time slots for test day 1
The previous figure represents the MAPE variations across all the time slots for test day 1. It
is observed that the minimum MAPE value is 0.54 percent and the maximum MAPE value is
around 6.47 percent. This indicates very less deviations of predicted travel time from the
actual
travel time.
The MAPE variations across test day 2 is also plotted in the spatial domain.
7
4
MAPE
0
1 18 35 52 69 86 103120137154171188205222239256273290307324341358375392409426443460
Fig 5.10 MAPE variations across all time slots for test day 2
The above graph indicates that the minimum MAPE value for test day 2 in the spatial domain
is 0.54 percent and the maximum MAPE value obtained is 6.90 percent.
Hence the MAPE variations across the spatial domain is represented in this section. From
this, we can conclude that the proposed random forest model performs well in both temporal
and spatial domain. The MAPE obtained for both the cases is quite less and the model is
efficient and robust.
5.2.3 Spatiotemporal error patterns

Fig 5.11 Heatmap representing spatiotemporal error patterns in terms of MAE
The above heatmap represents the spatiotemporal error patterns in terms of MAE. The MAE
across all the segments is represented considering all the time slots throughout the day. The
grids in blue indicate lower MAE values and the grids in red indicate higher MAE values. It
is observed from the heatmap that only at few time slots, the grid is in red color which
represents heavy traffic congestion. Those time slots are considered as peak hours. At these
peak hours, the deviations also increase.
Similarly, the spatiotemporal error patterns can be represented in terms of MAPE.

Fig 5.12 Heatmap representing spatiotemporal error patterns in terms of MAPE
The grids in red indicate higher MAPE values thus representing traffic congestion.
5.2.4 MAPE variations for all segments
3.5
2.5
MAPE
1.5
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Segment_ID
Fig 5.13 Variations in MAPE for all segments
The above figure represents the MAPE variations for all segments. It is observed that the
minimum MAPE obtained is 0.08 percent for segment 24 which is 0.5 km in length. The
maximum MAPE obtained is 3.53 percent for segment 7 which is 1.5 km in length. Hence the
random forest algorithm is performing well with minimum deviations from the actual value.
The percentage of accuracy obtained is high.
5.3 Comparison of random forest model with other models
From the previous discussions, we observe that the random forest algorithm performs
satisfactorily in predicting the travel time. In this current section, the performance of random
forest algorithm is compared with other algorithms like SVM, historical average, neural
networks and simple linear regression.
CHAPTER 6 – CONCLUSION AND FUTURE WORK
6.1 Conclusion
In this thesis, we addressed the problem of travel time prediction using the random forest
algorithm. The traffic data collected from LTA, Singapore was utilized to determine the
travel time between the segments in the Westbound line. A random forest model is built to
predict travel time. The number of trees for regression is decided using the cross-validation
datasets. The number of trees which give the least MAPE for each segment is chosen as the
optimum number of trees and the model is accordingly developed. Once the model learns
from the training data set, it can be used to predict the travel time of the test dataset.
From the previous chapter, one can conclude that the random forest algorithm is one of the
best algorithms in predicting the travel time. The results obtained using this model has a high
percentage of accuracy. The maximum MAPE obtained using random forest is around 3.5
percent which proves that the accuracy and efficiency of the model is really high. In both the
temporal and the spatial domain, the random forest model gives good results. It is a very
robust algorithm.
In comparison with other traditional models, it is observed that the random forest model
outperforms the others. Hence, the Intelligent Transport System (ITS) can adopt this
algorithm in the prediction of travel times in real-time scenarios. Not only does this model
perform well, but it is easy to develop the model in a short span of time. It does not consume
long hours for training and there is no problem of overfitting. With the accurate prediction of
travel time, the problem of traffic congestion can be reduced and the passengers can plan
their trip accordingly.
6.2 Future work
The accuracy achieved in this implementation is around 96.5% and can certainly be increased
by taking some future steps. The algorithm for the random forest can be experimented little
more by increasing the range of the number of trees considered. The number of intervals
considered for predicting the travel time can also be varied.
Deep learning methods are gaining a lot of importance these days. In the random forest
algorithm discussed above, we are considering each segment separately and then building the
model. In the case of deep learning methods, we can design a single model which works for
all the segments. The data with high dimensionalities can be easily dealt with deep learning
models like Long Short Term Memory (LSTM) and Convolutional Neural Network (CNN)
models. These methods are scalable and suitable for network scale travel time as well. Hence
a deep learning approach can be used to predict the travel time between segments.
REFERENCES
[1] Sumit Mallik. Intelligent transportation system. International Journal of Civil

Engineering Research, 5(4):376-372, 2014
[2] Oded Cats and Gerasimos Loutos. Real-time bus arrival system: an empirical evaluation.
Journal of Intelligent Transportation Systems, 20(2):138-151, 2016.
[3] Luis Moreira-Matias, Joao Mendes-Moreira, Jorge Freire de Sousa, and Joao Gama.
Improving mass transit operations by using avl based systems: A survey. IEEE Transactions
on Intelligent Transportation Systems, 16(4): 1636-1653, 2015.
[4] R Prabha and Mohan G Kabadi. Overview of data collection methods for intelligent
transportation systems. The International Journal Of Engineering And Science (IJES),
5(3):16-20, 2016.
[5] Yang Li, Dimitrios Gunopulos, Cewu Lu, and Leonidas Guibas. Urban travel time
prediction using a small number of gps floating cars. In Proceedings of the 25th ACM
SIGSPATIAL International Conference on Advances in Geographic Information Systems,
page 3. ACM, 2017.
[6] Sehyun Tak, Sunghkoon Kim, Kiate Jang, and Hwasoo Yeo. Real-time travel time
prediction using multi-level k-nearest neighbor algorithm and data fusion method. In
Computing in Civil and Building Engineering (2014), pages 1861-1868, 2014.
[7] Jiwon Myung, Dong-Kyu Kim, Seung-Young Kho, and Chang-Ho Park. Travel time
prediction using k nearest neighbor method with combined data from vehicle detector system
and automatic toll collection system. Transportation Research Record: Journal of the
Transportation Research Board, (2256):51-59, 2011.
[8] Jiandong Zhao, Yuan Guo, and Zhiming Bai. Travel time prediction of expressway based
on multi-dimensional data and the particle swarm optimization-autoregressive moving
average with exogenous input model. Advances in Mechanical Engineering, 10(2):
1687814018760932, 2018.
[9] W Suwardo, Madzlan Napiah, and Ibrahim Kamaruddin. Arima models for bus travel
time prediction. Journal of the institute of engineers Malaysia, pages 49-58, 2010.
[10] Jinhwan JANG. Short-term travel time prediction using the kalman filter combined with
varaiable aggregation interval scheme. Journal of the Eastern Asia Society for Transportation
Studies, 10:1884-1895, 2013.
[11] Leone Pereira Masiero, Marco Antonio Casanova, and Marcelo Tilio M de Carvalho.
Travel time prediction using machine learning. In Proceedings of the 4th ACM SIGSPATIAL
International Workshop on Advances on Computational Transportation Science, pages 34-38.
ACM, 2011.
[12] Johar Amita, SS Jain, and PK Garg. Prediction of bus travel time using ann: a case study
in delhi. Transportation Research Procedia, 17:263-272, 2016.
[13] Zegeye Kebede Gurmu and Wei David Fan. Artificial neural network travel time
prediction model for buses using only gps data. Journal of Public Transportation, 17(2):3,
2014.
[14] Zhang Junyou, Wang Fanyu, and Wang Shufeng. Application of support vector machine
in bus travel time prediction. International Journal of Systems Engineering, 2(1):21, 2018.
[15] Subrina Akter, Lutfun Nahar, Shamima Akter and Tanjil Huda. Travel Time Prediction
using Support Vector Machine (SVM) and Weighted Moving Average (WMA). International
Journal of Engineering Research and Technology, 2278-0181, 2015.
[16] M Yang, C Chen, L Wang, X Yan, and L Zhou. Bus arrival time prediction using support
vector machine with genetic algorithm. Neural Network World, 26(3):205, 2016.
[17] Matthias Kormaksson, Luciano Barbosa, Marcos R Vieira, and Bianca Zadrozny. Bus
travel time predictions using additive models. In 2014 IEEE International Conference on
Data Mining, pages 875-880. IEEE, 2014.
[18] Rudra Pratap Deb Nath, Hyun-Jo Lee, Nihad Karim Chowdhury, and Jae-Woo Chang.
Modified k-means clustering for travel time prediction based on historical travel data. In
International Conference on Knowledge Based and Intelligent Information and Engineering
Systems, pages 511-521. Springer, 2010.
[19] Walaa Alajali, Wei Zhou, Sheng Wen, and Yu Wang. Intersection traffic prediction using
decision tree models. Symmetry, 10(9):386, 2018.
[20] Mohammed Elhenaway, Abdallah A. Hassan, and Hesham Rakha. Travel time modeling
using spatiotemporal speed variation and a mixture of linear regressions. pages 113-120, 01
2018.
[21] Yi Hou and Praveen Edara. Network scale travel time prediction using deep learning.
Transportation Research Record, page 0361198118776139, 2018.
[22] Alireza Ermagun and David Levinson. Spatiotemporal traffic forecasting: review and
proposed directions. Transport Reviews, 38(6):786-814, 2018.
[23] Google Images, google, www.google.co.sg

Travel Time Prediction Using Random Forest

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Travel Time Prediction Using Random Forest

Uploaded by

Copyright:

Available Formats

Travel Time Prediction Using Random Forest

SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING

A DISSERTATION SUBMITTED IN PARTIAL FULFILMENT OF

THE REQUIREMENTS FOR THE DEGREE OF MASTER OF

SCIENCE IN COMPUTER CONTROL&AUTOMATION

Table 4.1 Length of segments.........................................Error: Reference source not found

Fig 3.1 Supervised learning model....................................................................................19

1.1 Travel time prediction

1.2 Motivation and Background

In developed countries, expanding roadway infrastructure is becoming less of an option for

1.3 Organization of report

CHAPTER 2 – LITERATURE REVIEW

2.2 Intelligent Transport System (ITS)

2.3 Traffic data collection

2.4 Time series methods

SVM or Support Vector Machine is a supervised machine learning algorithm, which is

Clustering is an unsupervised machine learning technique that involves grouping of data

2.6 Deep learning methods

CHAPTER 3 – MACHINE LEARNING ALGORITHMS

3.2 Machine learning algorithms

The prominent supervised learning algorithms are:

Fig 3.1 Supervised learning model

The most commonly used algorithms are:

Fig 3.4 Difference between learning algorithms

3.3 Random forest

Fig 3.5 Bootstrapping

Bagging or bootstrap aggregation is the application of the bootstrap procedure to a high

3.3.2 Decision trees

3.3.3 Random forest as a classifier

The figure shown below illustrates classification using random forests.

Fig 3.7 Classification using random forest

Fig 3.8 Random forest for regression

The above figure is an example of regression using random forest algorithm.

 Used for both classification and regression problems

4.2 Project pipeline

 Traffic data collection

Traffic data collection

Data set preparation

Travel time prediction

4.2.1 Traffic data collection

Fig 4.2 Segments in Westbound and Eastbound line

Table 4.1 Length of segments

The above table gives us the length of the 29 segments considered.

Outliers in the data can be identified using various methods, namely:

 Box and whisker plots

 Nearest good data

DATASET NUMBER OF DAYS

Table 4.2 Dataset division

Fig 4.3 Flow chart of the random forest model

Fig 4.4 Variations in MAPE with the number of trees

3) The travel time of the test dataset is predicted.

 Support Vector Machine (SVM)

5.2.1 Performance evaluation in temporal domain

Time index (3 mins each)

Time index (3 mins each)

1000 Actual travel time

Time index (3 mins each)

Fig 5.5 Variations in MAPE in segment 9 for different test days

Fig 5.6 Variation in MAPE in segment 12 for different test days

5.2.2 Performance evaluation in spatial domain

Fig 5.7 Traffic state of day 1 in the interval 12.30 to 12.33 AM

2500 Actual travel time

Fig 5.8 Traffic state of day 1 in the interval 11.42 to 11.45 PM