Dissertation-Time at Door

Dissertation- Predicting Time at Door
Konstantinos-Michail Mylonas
September 7, 2017
1
Contents
I Abstract 6
II Introduction 6
1 Purpose 7
2 Background 8
3 Acknowledgements 8
4 Literature Overview 8
III Methodology 9
5 Pre-Processing 9
5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6 Exploratory Data Analysis 12
7 Statistical Modelling 18
7.1 Multilevel Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.3 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.4.1 Chi-Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.5 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.5.1 Other Splitting Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2
7.5.2 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
IV Results 25
7.6 Predicting Estimated Delivery Time . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.6.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.6.2 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.6.3 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.7 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
V Discussion 34
3
List of Figures
1 Figure 1 : The graph depicts clear pattern in early product codes. Then, the
weight of products increases in late product type, there is a break from the pattern
which . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Figure 2 : The weights of items found in each product category is increasing as
the product code increases. Additionally, the presence of outliers diminishes as
the product type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Figure 3 : The weights of items found in each product category is increasing as
the product code increases. Additionally, the presence of outliers diminishes as
the product type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Figure 4 : The graph shows that as the number of individuals pieces increases
so does the estimated delivery time. Moreover, it is prevalent existence of outlier
corresponding to larger item in the data . . . . . . . . . . . . . . . . . . . . . . . 16
5 Figure 5 : The plot shows no particular difference in delivery time on the different
time slot. However, there many outliers in the morning slot . . . . . . . . . . . . 17
6 Figure 6 :The graph shows mean of delivery time against the number of orders
sorted by date. As it can be observed, as the number of orders increases , the
fewer the chances are to have available data. On the contrary, it seems that most
single orders concern larger item . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 Figure 8 : The plot shows that the predicted is following the actual line indicating
that in most cases there the tow values are close. However, another feature that
capture the attention is the occasional large estimated delivery time . . . . . . . 26
8 Figure 9 : The plot shows that the predicted is following the actual line indicating
that in most cases there the tow values are close. However, another feature that
capture the attention is the occasional large estimated delivery time . . . . . . . 28
4
9 Figure 10 : The plot shows even though in most cases delivery time is predicted
close the actual value, there is a systematic pattern of inflating the estimated
delivery time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10 Figure 11 : The plot shows that the predicted is following the actual line indi-
cating that in most cases there the tow values are close. However, another feature
that capture the attention is the occasional large estimated delivery time . . . . 30
11 Figure 12 : plots show that collection item and product type play the most
significant part in estimating duration of delivery followed by item weight and
the number of individual pieces that order is comprised . . . . . . . . . . . . . . 31
12 Figure 13 : The plot shows that the predicted is following the actual line indi-
cating that in most cases there the tow values are close. However, another feature
that capture the attention is the occasional large estimated delivery time . . . . 32
13 Figure 14 : The graph shows that those participant who dropped the survey had
higher Emotional Risk Scores. Furthermore, the existence of outliers is observed
in population of participants who remained in the survey . . . . . . . . . . . . . 33
5
List of Tables
1 Description of DFS DET Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 The largest proportion of information was decided not to be included to the the
final data set. Other covariates needed to be transformed to facilitate their use
by machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Description of DFS DET Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 A lot of key information was found in AppoloOrderDetails benefiting the data
with the actual whose aid is in incorporating spatial-temporal information to the
data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Output of the Multilevel Model-Random Effects . . . . . . . . . . . . . . . . . . 27
6 Output of the Multilevel Model-Fixed Effects . . . . . . . . . . . . . . . . . . . . 27
Part I
Abstract
This article considers the application of machine learning algorithms in optimizing routing sched-
ule in Delivery Industry by tackling a fundamental problem - predicting accurately the time
needed to deliver an order. Using the current operational research algorithms, delivery time
was predicted in an ad hoc manner without utilising past data on the process. By contrast, the
present research utilises past data on the estimation of delivery time in an attempt to ensure
better costumer experience. In this report, the results from several machine algorithms are
compared to provide . Although initially results meet projects goals of predicting 60% of order
within a three minute time window and with reasonable error metrics, hardware limitations
circumvented the potential of machine learning algorithms.
6
Part II
Introduction
Operational research is a science dedicated to apply advanced analytical methods to qualify
better decisions in real world problems. In particular, one of the main areas of concern in
operational research is route optimisation- a family of problems which attempts to find the
optimal way of routes for a fleet of vehicles to traverse in order to deliver to a given set of
customers. Its origins lie back to 19th century when Hamilton firstly mathematically formulated
the problem. However, like all branches of Operational Research it came into prominence during
World War 2 and the subsequent decades. At that time, experts relied on advanced mathematical
tools such as Combinatorix and Integer Programming to solve their problems. However, as time
moved forward, tools are either emerged or found an apply to this particular set of problems.
Most recently, it is endeavoured to apply Data Science methods in Routing Planning. In more
detail, a key parameter in this problem is to the time delivery services are needed to finish
a delivery. A novel approach entails past data of orders to estimate the time which delivery
experts spend with client using machine learning algorithms to find patterns between delivery
time and available covariates such as items characteristic and spatial information hidden in
delivery address.
1 Purpose
The purpose of this project is to build predictive models that estimate delivery time and com-
pare the developed models to prescribe the best course of action for estimating time at door.
Additionally, it is attempted to explore the appropriateness of external data resources such as
Google maps and subsequently exploit to enrich more traditional approaches.
7
2 Background
In todays world, globalisation has succeeded in bringing the world together through trade.
However, a far less known aspect of it is the struggle for moving products in the most efficient
manner. Indeed, as the volume of products increases, it is becoming more vital to rely on
more scientific ways to optimise route planning avoiding delays of the execution plan or missing
opportunities to deliver items. In this particular context, a retail company sought the aid of Data
Science Company Satalia providing with data from every stage of delivery process to estimate
the delivery time of orders.
3 Acknowledgements
I would like to express my deepest appreciation to my supervisor Professor Chris Nemeth,
whose selfless time and care were sometimes all that helped me endure, Mrs Vega and Satalia
for trusting me with this project and provide ample guidance and Dr Simon Tomlinson for
bringing all these opportunities to Lancaster University. Finally, a thank to all my Professors
for beating in me the pursue of excellence under the most dire circumstances.
4 Literature Overview
The travelling salesman problem (TSP) is a great chapter in Operational Research by asking
the following question: Given a list of cities and the distances between each pair of cities, what
is the shortest possible route that visits each city exactly once and returns to the origin city?
or more scientifically expressed given a length L, the task is to decide whether the graph has
any tour shorter than L. Having arisen from business cycles of travelling merchants in Europe in
the first half of 19th century, W.R. Hamilton and Thomas Kirkman,the problem was formulated
in 1930 when intensively studied problems in optimization. However, it was as late as 1950s
8
before another touchstone was reached with the seminal paper of George Dantzig, Delbert Ray
Fulkerson and Selmer M. Johnson[1], expressing the problem as an integer linear program and
developing the cutting plane method for its solution using an example of 49 cities with a string
model. Later, Christofidis[2] developed the worst-case scenario in 1976. Owing to the speed
and simplicity of algorithm, many hoped it would pave the way to a near optimal solution
method. In the 1990s, Applegate, Bixby, Chvtal, and Cook[3] and most recently, Cook and
others computing an optimal tour through an 85,900-city instance given by a microchip layout
problem,
Part III
Methodology
5 Pre-Processing
5.1 Data
The data was provided by Satalia and consists of four different data sets which contains infor-
mation for tracking both orders and items in every stage of delivery procedure . Firstly, DFS
DEL HIST data set holds information about the delivery of products in the delivery branches
distributed to their final destination. In particular, this data set bears the branch id, the loca-
tion of delivery branch and expected date and time of delivery, along with the route and the
van id through the products will be transported to the branch. Secondly, DFS DEL DET data
set gives a description of the products characteristics to be delivered. More specifically, DFS
DEL DET contains information such as the items weight, volume, number of individual pieces,
products category and a brief description. Moreover, it is included a notice whether delivery
expert is supposed to expect a payment or an item on loan by the .The aforementioned data sets
9
were merged using inner join command of dplyr package. However,working in R it was deemed
necessary to reduce the dimensions of both DFS DEL HIST DFS DEL DET due to memory
storage restrictions and to their irrelevance to the given task. Thus, many covariates dropped
giving rise to the DFS DEL dataset. A description of the resulting data set is given below
Table 1: Description of DFS DET Data Set
Coveriate Name Data Type Description
order number Integer Order id used as primary Key
product type Factor
item volume Integer Volume of item measured in
Individual pieces Factor Number of pieces of a particular order
item weight Integer Weight of item measured in
Collection Item Logical A Binary Variable indicating whether there is an item on loan needed to be retrieved
delivery address postcode String
delivery time Time Stamp The date that delivery is scheduled
Table 2: The largest proportion of information was decided not to be included to the the final
data set. Other covariates needed to be transformed to facilitate their use by machine learning
algorithms
[H] Indeed, both Collection Item and delivery address postcode were transformed to binary
variables and to used only the three first digits to facilitate clustering amongst observations.
Moreover, it was deemed to use variables Individual pieces and product type as factors rather
than integer due to the special meaning of numeric values which is associated with categories.
Turning to the third data set, ApoloOrders holds data about the deliveries such as the order id,
the time of arrival and departure, location of delivery destination giving both postcode address
and the longitude and latitude of the location, date time, slot that delivery took place and time
spent with client. In this dataset is found a key information the timeAtDoor covariate.Similarly,
it was decided to join the data set using Order id as primary key leaving only the features in in
10
the final data set Therefore, the aforementioned features were selected to merge with DFS DET
to compose the final data set. Below, it is given a short description of the emerged data set.
Table 3: Description of DFS DET Data Set
Coveriate Name Data Type Description
order number Integer Order id used as primary Key
product type Integer used as Factor
item volume Integer Volume of item measured in
Individual pieces Factor Number of pieces of a particular order
item weight Integer Weight of item measured in
Collection Item Logical A Binary Variable indicating whether there is an item on loan needed to be retrieved
delivery address postcode String
delivery time Time Stamp The date that delivery is scheduled
lay Real Latitude that delivery Location
lng Real Longitude of the delivery Location
slotype Categorical Slot that delivery took place
timeAtDoor Real Time that Delivery Expert Spent with Client
Table 4: A lot of key information was found in AppoloOrderDetails benefiting the data with the
actual whose aid is in incorporating spatial-temporal information to the data set
[H]
By enriching the data, it becomes feasible the addition of more advanced modelling tech-
niques such as Spatial model and Gaussian Processes to the analysis. Finally, HistMasterDrops
illustrates the itineraries of each delivery vehicle such as the number and location of each stop
along the route and information pertaining to the proportion of time spent actively during the
journey. Although, it is not considered necessary to merge this dataset with previous data set,
it is asked to endeavour to find a smart way of joining the two data sets. In absence of common
columns that could serve as primary key, it was decided to use postcode and dates columns to
join the emerged dataset with histMasterDrops.
11
5.2 Feature Engineering
Having obtained the final dataset, it was decided that extra features should be geared to firstly
reduce further the dimensions of the data and secondly facilitate their inclusion on the models.
Thus, it was created the attribute of TimDiff which is the time interval between the arrivalTime
and departureTime. It is believed that this conversion would help implement time interval
information in the model- something that would be infeasible employing the original columns.
Secondly, it is deemed necessary to convert timeAtDoor column from seconds to minutes to
facilitate the assessment of performance of models. Finally, it was decided to create a column
with the first three or four digits of postcode in an attempt to cluster together observations lying
juxtapose in the space for Spatial Models.
6 Exploratory Data Analysis
The next stage of the analysis involved an exploratory analysis on the data. Although, there is a
plethora of covariates that intuitively influence the duration of a delivery, the analysis attempts
to retrieve key relationships in the data. In doing so, a number of graphical tools will be
entailed to disseminate information. Once the power of the relationships has been established,
the statistically important features will be used as covariates in machine learning algorithms.
Additionally, spatial-temporal correlations will be taken into account due to the inclusion of
models whose parameters involve amongst observations in the data set. At the beginning, one
of the features that mostly of the item is the product code .Subsequently, analysis revolved
around of how timeatDoor differs with respect the product category. In this case, if this holds
then mean of each product would differ. Thus, to present tangible evidence of, it is decided to
plot the mean estimated delivery time with respect to product type.
12
Scatter plot of Mean Time per Product Type
Figure 1 : The graph depicts clear pattern in early product codes. Then, the weight of products
increases in late product type, there is a break from the pattern which
[H] As it is expected, similar products need approximately the same amount of time to be
delivered as the graph depicts. More importantly, major deviations are illustrated amongst
clusters of product categories. Indeed it is natural for categories 2 ,3 and associated with to take
significantly less amount of time to be delivered compared to products such as sofas or mirrors.
However, displaying the mean of each categories is not enough to be able paint the larger picture.
On the contrary, it is needed a more thorough treatment of the data to let them fully express
for themselves. Thus, to obtain a better picture of how the delivery time is distributes within
each product category, a bar plot is plotted
13
Bar plot of Estimated Delivery Time Per Product Type
Figure 2 : The weights of items found in each product category is increasing as the product
code increases. Additionally, the presence of outliers diminishes as the product type
[H] As it can be observed, outliers are prevalent especially in the earlier categories where.
However, it is of interest to discover the reason behind the difference between delivery time
within product categories. At this point, analysis turns to other covariates for explaining the
great divergence between delivery time. Initially, it is thought that irrespective of which category
items belonged, weight and number of the individuals items play a significant part in the inflation
of delivery time. This holds because heavier items intuitively require more time to be transported
from delivery vehicles stopping location to the delivery address. Secondly, items comprised of a
large number of individual pieces need to be assembled in delivery location devoting time which
is added to the overall delivery time. However, this seems to be a more challenging task than
it appeared to be since delivery specialist often do not monitor the time especially with items
corresponding to certain categories. This results in having incomplete information about the
actual duration of the delivery.
14
Bar plot of Item Weights Per Product Type
Figure 3 : The weights of items found in each product category is increasing as the product
code increases. Additionally, the presence of outliers diminishes as the product type
[H] The resulting plot seems to have imperfect information for certain product categories
related to heavier items. This disquieting ritual amongst delivery experts would have its impact
on model whose performance will be hampered by the limitation of collected data. Turning
to the individual pieces, a bar plot shows the distribution of Delivery time with respect to the
individual pieces.
15
Bar plot of Delivery Time per Number of Products Individual Pieces
Figure 4 : The graph shows that as the number of individuals pieces increases so does the
estimated delivery time. Moreover, it is prevalent existence of outlier corresponding to larger
item in the data
[H] In the first few categories the median is largely unchanged, however there is upward
trend in the behaviour of median. Conversely, as the product type changes moving to larger
item categories delivery times are homogenised causing the extinction of outliers. Turning to
other factors that might influence time spent with clientle, is considered. Again, it seems that
the number of individual items influences the time spent with client since it is observed a stable
increase of time as the number of pieces is increasing with steeper increase on products comprised
of 6 items. Another observation that needs to be stressed is the existence of outlying observations
in the first most possibly connected with excessive weight
16
Bar plot of Estimated Delivery Time With Respect to Item Collection
Figure 5 : The plot shows no particular difference in delivery time on the different time slot.
However, there many outliers in the morning slot
[H] This may be resulted from traffic and stopping restrictions on vehicle during rush hours
spurring delivery vehicle to stop or search for suitable parking space that does not fall into
parking restrictions. However, on the whole large deviations of time are not observed in the bar
plot leading to confirm initial expectations.
17
Bar plot of Mean of Estimated Time At Door per Number of Orders
Figure 6 :The graph shows mean of delivery time against the number of orders sorted by date.
As it can be observed, as the number of orders increases , the fewer the chances are to have
available data. On the contrary, it seems that most single orders concern larger item
[H]
7 Statistical Modelling
To forecast the TAD, it was decided Random Forests and Deep Learning neural networks meth-
ods to be employed due to methods robustness and their capacity to model large scale problems
with complex relationships. Secondly , because of the results from Explanatory data analysis cor-
roborate that postcode carry enough predictive power, Multilevel Modelling along with Spatial
Temporal Models were considered as modelling options using postcodes to cluster observations
and investigate the spatial pattern which they follow. Lastly, Gaussian Processes were employed
due to the existence of latitude and longitude for each order. Below the main utilised algorithm
are described.
18
7.1 Multilevel Modelling
Multilevel Models implement linear or generalised linear regression models on clustered data[4].
This is achieved by allowing the intercept and coefficient vary for each cluster in the data. In
particular, the Multilevel Models are comprised of levels of hierarchies. In each level there is
regression model which give rise to the next hierarchies with last layer being the level where
observation lie. To achieve this the models coefficients contain fixed and random effects which
are defined in level than the model formula.
y = 0 + 1 x1 + . . . + n xn + t
. . . 0 = 00 + 01 1 . . . + = + u s
The global coefficients have similar properties with their opposite numbers in simple linear
regression and are given by solving the same equations. Whereas, random effects part are given
by random number drawn from a standard Normal. The random part discerns the different
clusters in data giving a more flexible and expressive model.As for the assumptions[5] governing
the Multilevel models are very similar to simple linear models.
Linearity The assumption of linearity states that there is a rectilinear (straight-line, as opposed
to non-linear or U-shaped) relationship between variables.
Normality The assumption of normality states that the error terms at every level of the model
are normally distributed.
Homoscedasticity The assumption of homoscedasticity assumes equality of population vari-
ances
Independence of observations Independence is an assumption of general linear models, which
states that cases are random samples from the population and that scores on the depen-
dent variable are independent of each other.However, multilevel models usually deal with
19
cases where the assumption of independence is violated. Thus, multilevel models alter the
hypothesis of simple linear regression by assuming independence of within and in between
different level residuals.
7.2 Neural Networks
Neural Networks have infinite capacity of modelling complex and large data such as stock prices
[6] and [7]. However, most noticeably Neural Net excel at High Dimensional Data[8] Artificial
Neural Nets are processing units which resemble the neuronal structure of human cerebral cortex
in a smaller scale. Neural Networks are organised in layers which by turn are comprised of
interconnected nodes[9]. These layers can be categorised in input, hidden, output layers where
the first two are concerned with receiving and presenting information. As for the latter, it is
burdened with disseminating the information from input layer to identify patterns in the data.
The artificial neural net learns to classify instances by being exposed to patterns linked with
categories found in the data set in question. In doing so, neural network extends the notion
of linear models by making the basic functions (x)andw to be dependent on parameters
using basics functions[10]. This adjustment gives rise to the basic neural network. Then, a
learning rule- most frequently delta method is employed to adjust connections weightsw . This
is achieved by implementing gradient descent algorithm within the solution vector space to find
a global minimum in order to minimise errors.In particular, to update the weights gradient
descent algorithm is used
wk+1 = wk E(wk ) (1)
where is the learning rate- a indication how small the step toward the direction of error function
n
X n
X
E(w) = E(w )En = (yn tn )2 (2)
=1 =1
20
w is the weight attached to the connection
This formula adjusts the weight towards the directions greatest decrease of the error with
a small step to safeguard against the prospect of local minima. However, this gradient descent
algorithm is not feasible to be applied , without the aid of Back propagation
n
X
E(w ) = (yn f (wT x )2 (3)
=1
In particular in this context is

1
f (x) = (4)
1 + ew x
Taking partial derivatives
En
= f (u)(1 f (u)) = f (u)f (u)
u
However, due to the existence of activation function,
X
a = f ( w x ) (5)

each unit computes using the chain rule
En En a u
= = (y t)y(1 y)x (6)
w u w
Turning to the Deep Learning Networks[11], they work in the same fashion as Neural Net-
works. However, the difference lies in the number of hidden layers involved. Moreover, in this
particular context the developed net is a feed forward deep learning network with a 10 with
7.3 Gaussian Processes
Gaussian Processes can be seen as generalisation of linear and polynomials models. In particular,
instead of making assumptions about what kind of curve could fit the data, a less parametric
21
approach is taken. This approach enables to see the data as incarnations of points coming from
multivariate Normal[12]. In this context as in many other, the mean function is assumed to be
0 with each point correlating to each other according to
(x x0 )2
k(x, x0 ) = f (7)
2l2
where f the maximum allowable covariance is defined as this should be high for functions
which cover a broad range on the y axis. Each observation y can be thought of as related to an
underlying function f(x) through a Gaussian noise model:
y = f (x) + N (0, n2 ) (8)
Just for simplicity of exposition, the noise is folded into k(x, x0) ), by writing
(x x )2
k(x, x ) = f2 exp + f2 (x, x ) (9)
2l
where (x, x is the Kronecker delta function. Thus, given n observations y, our objective is to
predict y, not the actual f their expected values are identical according to , but their variances
differ owing to the observational noise process. Calculating the covariance matrix[?]

k(x1 , x1 ) . . . k(x1 , xn )

k(x , x ) . . .
2 1
K=
K = (k(x , x ), . . . , k(x , x )) K = k(x , x1 )[13] (10)
1 n
.
.. .
..

k(xn , x1 ) . . . k(xn , xn )
Then, using the assumption that data comes form Multivariate Normal

t
y K K
N (0, ) (11)

y K K
where T indicates matrix transposition. We are of course interested in the conditional probability
22
p(yy): given the data, how likely is a certain prediction for y
y |y N (K K 1 y, K K K 1 K )
with mean estimate and variance
y = K K 1 y
var(y) = K K K 1 K
[13]
7.4 Decision Trees
Decision trees belongs to the category of supervised algorithms employed to classify both cate-
gorical and continuous data. In particular, a decision tree is a flowchart tree in which through
successive testing in each internal node[14]. Branches emerge denoting the outcome of the test
leading to further testing or a leafing node which holds the class label of the instance. Given a
training instance ,it goes through a pattern of questions in which is decided to which category
should be assigned. However a common problem is how to organise the sequence of ques-
tions Therefore, a number of splitting criteria has been developed to ensure the homogeneity of
emerging subclasses. Suppose a list of attributes of dataset A = (S1 , S2 , . . . , Sn ) . The split is
performed in the most informative attribute according to one of the following criterion
7.4.1 Chi-Square
Chi-square algorithm finds out the statistical significance between the differences between sub-
nodes and parent node. It is measured by sum of squares of standardized differences between
observed and expected frequencies of target variable. The algorithm works for categorical target
variable whose value the higher is higher the statistical significance of differences between sub-
23
node and Parent node.
(Actual Expected)2
Chisquare = (12)
(Expected)1/2
7.5 Information Gain
Although originated from Physics, Information gain or Entropy is a measure to define this degree
of disorganization in a system.Thus it allows to discern pure and less pure nodes and take split
according to which feature requires less information to be described, Entropy can be calculated
using formula:-Entropy, Decision Tree
Inf ormationGain = p log2 p + q log2 q (13)
Here p and q is probability of success and failure respectively in that node. Entropy is also
used with categorical target variable. It chooses the split which has lowest entropy compared to
parent node and other splits. The lesser the entropy, the better it is.
7.5.1 Other Splitting Criteria
Gini index says, if we select two items from a population at random then they must be of same
class and probability for this is 1 if population is pure.
It works with categorical target variable Success or Failure. It performs only Binary splits
Higher the value of Gini higher the homogeneity. CART (Classification and Regression Tree)
uses Gini method to create binary splits. The Gain Ratio criterion is a normalising version of
information which has the tendency to split towards test towards with many outcomes. Thus
algorithm which represents potential information by splitting training instances into n partitions-
the same number of outcome. After, the criterion is selected, the data set is split to all the
distinct values of the this attribute Sx . Subsequently, another spiting attribute is selected and
calculated based on the frequencies of the distinct values. This procedure continues until no
attribute is left.
24
7.5.2 Ensemble Methods
Despite the nice properties of decision trees, the algorithm generally suffers from high bias even
in its more complex incarnations. Thus, the aid of ensemble methods is entailed to reduce
variance of predictions. In doing so, bagging builds several weak learners on the different sub-
samples of the same data. A random forest is an implementation of the bagging paradigm which
grows a number of weak learners whose final prediction are combined using mean/median or
major voting approach to classify instances. Turning to boosting, similarly to other ensemble
methods, it employs weak learners to create stronger classification rules. To create such rules, it
must be firstly defined a weak learner. This is achieved by applying lose learning with different
distribution. Gradient Boost Tree encapsulates boosting paradigms. Typically, Gradient Boost
Tree grows an number of decision trees whose predictions are summed. However, the next
decision tree attempts to minimise the observed error by reconstructing the residuals from the
target function and gradient boosted trees.
Part IV
Results
7.6 Predicting Estimated Delivery Time
At this point, having identified important relationships in the data set, analysis turned to
predicting delivery time employing a number of machine learning algorithms. In this chapter,
it is attempted to developed models with two objectives foremost to minimising error as it is
portrayed through mean square error ,mean absolute error, mean and latter to predict 70 % of
orders delivery time within 3 minutes of time window error. Initially, analysis turned to the
most ubiquitous algorithms for handing this task- regression. Indeed, regression is utilised for
25
forecasting or predicting calendar events leading to believe that this course of action would be
the the best since it is simple and it could be potentially used as benchmark for more advanced
methods. Therefore a simple linear regression model ids developed .In more detail, the model
along with a full description is given below
Line Plot of Predicted against Actual Predicted Time at Door
Figure 8 : The plot shows that the predicted is following the actual line indicating that in most
cases there the tow values are close. However, another feature that capture the attention is the
occasional large estimated delivery time
[H] Turning to the performance of the model, the aforementioned linear regression model suc-
ceeded in predicting 68.76%of observations within three minute time window and rmse 7.029438
However, it is becoming more popular to explore the spatial structure of the observations for
enhancing the performance of model. Indeed, spatial information such as address postcodes can
capture hidden relationships in the data[15]. This information might be able to add another
piece since it allows observation to be clustering according to spatial information- postcodes.
This is of importance since observation lying proximate on the map might have similar duration
26
Table 5: Output of the Multilevel Model-Random Effects
Name Variance Std.Dev.
new PCod (Intercept) 26328 162.3
Residual 161291 401.6
Table 6: Output of the Multilevel Model-Fixed Effects
Covariates Estimate Std. Error t value
(Intercept) 1463.09 21.60996 67.70
item weight -0.26362 0.04454 -5.92
product type 4.62219 0.15899 29.07
individual pieces -36.98786 2.79370 -13.24
collection item -94.08842 9.84325 -9.56
and be affected differently by covariates. Thus, a Multilevel model is entailed to develop differ-
ent regression model for each cluster of observations by letting intercept and coefficients vary
Below, it is found a description of the model and hierarchies structure of the model
y = 0 + 1 xitemw eight + 2 xproductt ype + 3 xindividualP ieces + 4 xcollectionitem + t
0 = 00 + s
Having identified the model parameters, focus is shifted to the performance of the model. The
model identified 72.51% of the observation within three minute time interval with root mean
square and mean absolute error 6.13 and 3.72. Below it is depicted a plot of both predicted
actual values for graphical inspection.
27
Figure 9 : The plot shows that the predicted is following the actual line indicating that in most
cases there the tow values are close. However, another feature that capture the attention is the
occasional large estimated delivery time
[H] From the plot, it is ostensible that there is alarming pattern of predicting the duration
of delivery . Turning to another popular method, Support Vector Machines is employed to
predict delivery time. However, in this particular case, it verified that initial expectations that
it would not perform equally well as other methods due the the complexity of data without a
smart choice for kernel. Support Vector Machines managed to predict 69% of observation in the
desired interval with inflated mean absolute error 7.20 and root mean square error 4.24
28
Figure 10 : The plot shows even though in most cases delivery time is predicted close the
actual value, there is a systematic pattern of inflating the estimated delivery time
[H]
7.6.1 Decision Trees
Leaving the sub par performance of Support Vector Machine behind, it is time to attempt
to predict delivery time utilising more advanced methods- decision trees machine algorithm.
However, due to some limitations, it is considered more beneficial to employ ensemble methods
with decision trees such as Random Forest and Gradient Boosted Trees. Turning first to Random
Forest, the algorithm succeeds in predicting 70% within a 3 minute time interval. Additionally,
the performance is reasonable giving respectively for Mean Square Error, Mean Absolute Error
and Root Mean Square Error. Below it can be found a plot illustrates the performance of the
Random Forests
29
Figure 11 : The plot shows that the predicted is following the actual line indicating that in
most cases there the tow values are close. However, another feature that capture the attention
is the occasional large estimated delivery time
[H] Furthermore, rf package used to developed the random forest supports graphical assess-
ment of the importance of variable in the model. Below, it is depicted a plot illustrating the key
variables in the model.
30
Importance Plot of Variables in Random Variables
Importance Plot of Variables in Random Variables
Figure 12 : plots show that collection item and product type play the most significant part
in estimating duration of delivery followed by item weight and the number of individual pieces
that order is comprised
[H] From the plot, it is concluded that the statistically important is somewhat counter-
intuitive to what it is expected based on the explanatory data analysis As for the Gradient
Boost Trees, the performance is similar to Random Forest with slight increase reaching 73% in
the percentage of correctly predicted observations within a 3 minute error. In particular, both
mean absolute error and root mean squares error are estimated at 3,73 and 6.22 respectively.
7.6.2 Deep Neural Networks
Finally, no effort would be complete without trying to develop a Deep Learning network. Due to
their capacity of dealing with complex and large data such as stock prices [6] and [7]. However,
most noticeably Neural Net excel at High Dimensional Data[8]. In particular, it is developed
a recurrent deep learning network with 3 hidden layers comprised of 10 nodes each with 500
31
epochs succeeding to correctly predicting 77% with three minute error. More specifically, mean
absolutely error is 2,75 and root square error experience is decreased to 3.012 compared Decision
trees
Figure 13 : The plot shows that the predicted is following the actual line indicating that in
most cases there the tow values are close. However, another feature that capture the attention
is the occasional large estimated delivery time
[H] It is apparent that it certainly deserved the time and resources to develop a Deep Learning
approach for this problem given the rewarding results.
7.6.3 Gaussian Processes
Although, industry is moving towards Deep Learning for the discussed above. However, it is
not the only option available. Ultimately,Gaussian Processes are found particular application in
modelling complex system [16] which include spatial-temporal information[17]. Thus, a model
is employed to predict delivery time using both item characteristics and spatial information.
In contrast to other method it is only feasible to implement in Amazon Web Services since its
32
computation complexity outweigh the computers resources. As for the performance of algorithm
put into shadow those previously discussed
Scatter Plot of Predicted against Actual Predicted Time at Door
Figure 14 : The graph shows that those participant who dropped the survey had higher
Emotional Risk Scores. Furthermore, the existence of outliers is observed in population of
participants who remained in the survey
[H]
7.7 Recommendation
In this part, no report would be complete without providing a list of recommendation that have
emerged from the scrutinisation over the project. It is hoped that this piece of work will provide
for future worker.
The following dataset DFS DET DEL, DFS DET HIST and AppoloORdersDetails provide a
useful insight into the problem and DFS DET DEL, DFS DET HIST should be combined
using inner join. The result data set should be merged with AppoloORdersDetails left
33
join.
Moreover, the usefulness of dataset has been circumvented by the lack of a primary key. It is
proposed to be joined using a combination of. It is recommended not to pay particular
attention to this dataset when no external data is considered.
Both explanatory data analysis revealed that individual pieces, item weight, product type play
the most important role in predicting time at Door.
Deep Learning, Gaussian Processes and Random Forests were amongst the machine learning
algorithms that bore the most fruitful results, Future workers should pay more attention on
optimisation the performance of those models or start by considering first those techniques.
Part V
Discussion
In an attempt to rethink routing optimisation problem as part of project, a plethora of ma-
chine learning algorithms were entailed to predict time at door. In particular, analysis bore
that although a number of factors influence the time spent with client, the type of product,
whether there is an item for collection and individual pieces consisting of item play the most
significant part in estimating time at Door. Secondly, all machine learning algorithms utilised
performed equally satisfactorily with the percentage of estimated time being within the three
minute window error lying in the almost 74%. However, when Gaussian Processes and Deep
Learning performed significantly better when the process was repeated in cloud computing plat-
forms. This the reason behind the shift of the industry to computational intensive methods
such as Deep Learning and in particular Gaussian Processes. On the whole, machine learning
algorithms successfully tackled the problem of predicting the time that a delivery expert needs
34
to stay with client with the majority of observations predicted within three minute interval from
the correct time Most importantly, this project stands on the same side with a few which en-
deavour to bring closer Operational Research and Data Science under the umbrella of Business
Intelligence. However, even thought the developed models bore tangible results, for the future,
it could be constructed a primary key to facilitate merging or use a combination of existing.
In addition, it was not feasible to enrich the data with external resources. In summary, the
present study adhere the aid of machine learning algorithms to predict time at Door. In doing
so, analysis bore tangible results qualifying state of art algorithms such Gaussian Processes and
Deep Learning for tackling this task.
References
[1] V. Chvatal, W. Cook, G. B. Dantzig, D. R. Fulkerson, and S. M. Johnson, Solution of
a large-scale traveling-salesman problem, 50 Years of Integer Programming 1958-2008,
pp. 728, 2010.
[2] N. Christofides, Worst-case analysis of a new heuristic for the travelling salesman problem,
tech. rep., Carnegie-Mellon Univ Pittsburgh Pa Management Sciences Research Group,
1976.
[3] D. L. Applegate, R. E. Bixby, V. Chvatal, and W. J. Cook, The traveling salesman problem:
a computational study. Princeton university press, 2011.
[4] M. Kuhn and K. Johnson, Applied predictive modeling, vol. 810. Springer, 2013.
[5] J. J. Faraway, Extending the linear model with R: generalized linear, mixed effects and
nonparametric regression models, vol. 124. CRC press, 2016.
[6] B. Mandelbrot, Forecasts of future prices, unbiased markets, and martingale models,
The Journal of Business, vol. 39, no. 1, pp. 242255, 1966.
35
[7] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, Predicting the sequence speci-
ficities of dna-and rna-binding proteins by deep learning, Nature biotechnology, vol. 33,
no. 8, pp. 831838, 2015.
[8] I. Arel, D. C. Rose, and R. Coop, Destin: A scalable deep learning architecture with
application to high-dimensional robust pattern recognition., in AAAI Fall Symposium:
Biologically Inspired Cognitive Architectures, 2009.
[9] P. Flach, Machine learning: the art and science of algorithms that make sense of data.
Cambridge University Press, 2012.
[10] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.
[11] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp. 436
444, 2015.
[12] D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques. MIT
press, 2009.
[13] M. Ebden et al., Gaussian processes for regression: A quick introduction, The Website
of Robotics Research Group in Department on Engineering Science, University of Oxford,
2008.
[14] L. Rokach and O. Maimon, Data mining with decision trees: theory and applications. World
scientific, 2014.
[15] K. E. Pickett and M. Pearl, Multilevel analyses of neighbourhood socioeconomic context
and health outcomes: a critical review, Journal of Epidemiology & Community Health,
vol. 55, no. 2, pp. 111122, 2001.
36
[16] N. Chen, Z. Qian, I. T. Nabney, and X. Meng, Wind power forecasts using gaussian
processes and numerical weather prediction, IEEE Transactions on Power Systems, vol. 29,
no. 2, pp. 656665, 2014.
[17] Y. Xie, K. Zhao, Y. Sun, and D. Chen, Gaussian processes for short-term traffic vol-
ume forecasting, Transportation Research Record: Journal of the Transportation Research
Board, no. 2165, pp. 6978, 2010.
37

Dissertation-Time at Door

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dissertation-Time at Door

Uploaded by

Copyright:

Available Formats

Dissertation- Predicting Time at Door

5.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6 Exploratory Data Analysis 12

7.1 Multilevel Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.3 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7.5 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7.5.1 Other Splitting Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7.6 Predicting Estimated Delivery Time . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.6.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.6.2 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.6.3 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2 Figure 2 : The weights of items found in each product category is increasing as

the product code increases. Additionally, the presence of outliers diminishes as

the product type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Figure 3 : The weights of items found in each product category is increasing as

the product code increases. Additionally, the presence of outliers diminishes as

the product type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

so does the estimated delivery time. Moreover, it is prevalent existence of outlier

corresponding to larger item in the data . . . . . . . . . . . . . . . . . . . . . . . 16

time slot. However, there many outliers in the morning slot . . . . . . . . . . . . 17

sorted by date. As it can be observed, as the number of orders increases , the

single orders concern larger item . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

capture the attention is the occasional large estimated delivery time . . . . . . . 26

capture the attention is the occasional large estimated delivery time . . . . . . . 28

significant part in estimating duration of delivery followed by item weight and

the number of individual pieces that order is comprised . . . . . . . . . . . . . . 31

higher Emotional Risk Scores. Furthermore, the existence of outliers is observed

in population of participants who remained in the survey . . . . . . . . . . . . . 33

1 Description of DFS DET Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 10

by machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Description of DFS DET Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 A lot of key information was found in AppoloOrderDetails benefiting the data

with the actual whose aid is in incorporating spatial-temporal information to the

5 Output of the Multilevel Model-Random Effects . . . . . . . . . . . . . . . . . . 27

6 Output of the Multilevel Model-Fixed Effects . . . . . . . . . . . . . . . . . . . . 27

circumvented the potential of machine learning algorithms.

Additionally, it is attempted to explore the appropriateness of external data resources such as

Google maps and subsequently exploit to enrich more traditional approaches.

the delivery time of orders.

I would like to express my deepest appreciation to my supervisor Professor Chris Nemeth,

Table 1: Description of DFS DET Data Set

Coveriate Name Data Type Description

order number Integer Order id used as primary Key

product type Factor

item volume Integer Volume of item measured in

Individual pieces Factor Number of pieces of a particular order

item weight Integer Weight of item measured in

delivery address postcode String

delivery time Time Stamp The date that delivery is scheduled

Table 3: Description of DFS DET Data Set

Coveriate Name Data Type Description

order number Integer Order id used as primary Key

product type Integer used as Factor

item volume Integer Volume of item measured in

Individual pieces Factor Number of pieces of a particular order

item weight Integer Weight of item measured in

delivery address postcode String

delivery time Time Stamp The date that delivery is scheduled

lay Real Latitude that delivery Location

lng Real Longitude of the delivery Location

slotype Categorical Slot that delivery took place

y = 0 + 1 xitemw eight + 2 xproductt ype + 3 xindividualP ieces + 4 xcollectionitem + t