Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Dissertation- Predicting Time at Door

Konstantinos-Michail Mylonas

September 7, 2017

1
Contents

I Abstract 6

II Introduction 6

1 Purpose 7

2 Background 8

3 Acknowledgements 8

4 Literature Overview 8

III Methodology 9

5 Pre-Processing 9

5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6 Exploratory Data Analysis 12

7 Statistical Modelling 18

7.1 Multilevel Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.3 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7.4.1 Chi-Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7.5 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7.5.1 Other Splitting Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2
7.5.2 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

IV Results 25

7.6 Predicting Estimated Delivery Time . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.6.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.6.2 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.6.3 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.7 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

V Discussion 34

3
List of Figures

1 Figure 1 : The graph depicts clear pattern in early product codes. Then, the

weight of products increases in late product type, there is a break from the pattern

which . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Figure 2 : The weights of items found in each product category is increasing as

the product code increases. Additionally, the presence of outliers diminishes as

the product type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Figure 3 : The weights of items found in each product category is increasing as

the product code increases. Additionally, the presence of outliers diminishes as

the product type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Figure 4 : The graph shows that as the number of individuals pieces increases

so does the estimated delivery time. Moreover, it is prevalent existence of outlier

corresponding to larger item in the data . . . . . . . . . . . . . . . . . . . . . . . 16

5 Figure 5 : The plot shows no particular difference in delivery time on the different

time slot. However, there many outliers in the morning slot . . . . . . . . . . . . 17

6 Figure 6 :The graph shows mean of delivery time against the number of orders

sorted by date. As it can be observed, as the number of orders increases , the

fewer the chances are to have available data. On the contrary, it seems that most

single orders concern larger item . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

7 Figure 8 : The plot shows that the predicted is following the actual line indicating

that in most cases there the tow values are close. However, another feature that

capture the attention is the occasional large estimated delivery time . . . . . . . 26

8 Figure 9 : The plot shows that the predicted is following the actual line indicating

that in most cases there the tow values are close. However, another feature that

capture the attention is the occasional large estimated delivery time . . . . . . . 28

4
9 Figure 10 : The plot shows even though in most cases delivery time is predicted

close the actual value, there is a systematic pattern of inflating the estimated

delivery time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

10 Figure 11 : The plot shows that the predicted is following the actual line indi-

cating that in most cases there the tow values are close. However, another feature

that capture the attention is the occasional large estimated delivery time . . . . 30

11 Figure 12 : plots show that collection item and product type play the most

significant part in estimating duration of delivery followed by item weight and

the number of individual pieces that order is comprised . . . . . . . . . . . . . . 31

12 Figure 13 : The plot shows that the predicted is following the actual line indi-

cating that in most cases there the tow values are close. However, another feature

that capture the attention is the occasional large estimated delivery time . . . . 32

13 Figure 14 : The graph shows that those participant who dropped the survey had

higher Emotional Risk Scores. Furthermore, the existence of outliers is observed

in population of participants who remained in the survey . . . . . . . . . . . . . 33

5
List of Tables

1 Description of DFS DET Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 The largest proportion of information was decided not to be included to the the

final data set. Other covariates needed to be transformed to facilitate their use

by machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Description of DFS DET Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 A lot of key information was found in AppoloOrderDetails benefiting the data

with the actual whose aid is in incorporating spatial-temporal information to the

data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Output of the Multilevel Model-Random Effects . . . . . . . . . . . . . . . . . . 27

6 Output of the Multilevel Model-Fixed Effects . . . . . . . . . . . . . . . . . . . . 27

Part I

Abstract
This article considers the application of machine learning algorithms in optimizing routing sched-

ule in Delivery Industry by tackling a fundamental problem - predicting accurately the time

needed to deliver an order. Using the current operational research algorithms, delivery time

was predicted in an ad hoc manner without utilising past data on the process. By contrast, the

present research utilises past data on the estimation of delivery time in an attempt to ensure

better costumer experience. In this report, the results from several machine algorithms are

compared to provide . Although initially results meet projects goals of predicting 60% of order

within a three minute time window and with reasonable error metrics, hardware limitations

circumvented the potential of machine learning algorithms.

6
Part II

Introduction
Operational research is a science dedicated to apply advanced analytical methods to qualify

better decisions in real world problems. In particular, one of the main areas of concern in

operational research is route optimisation- a family of problems which attempts to find the

optimal way of routes for a fleet of vehicles to traverse in order to deliver to a given set of

customers. Its origins lie back to 19th century when Hamilton firstly mathematically formulated

the problem. However, like all branches of Operational Research it came into prominence during

World War 2 and the subsequent decades. At that time, experts relied on advanced mathematical

tools such as Combinatorix and Integer Programming to solve their problems. However, as time

moved forward, tools are either emerged or found an apply to this particular set of problems.

Most recently, it is endeavoured to apply Data Science methods in Routing Planning. In more

detail, a key parameter in this problem is to the time delivery services are needed to finish

a delivery. A novel approach entails past data of orders to estimate the time which delivery

experts spend with client using machine learning algorithms to find patterns between delivery

time and available covariates such as items characteristic and spatial information hidden in

delivery address.

1 Purpose

The purpose of this project is to build predictive models that estimate delivery time and com-

pare the developed models to prescribe the best course of action for estimating time at door.

Additionally, it is attempted to explore the appropriateness of external data resources such as

Google maps and subsequently exploit to enrich more traditional approaches.

7
2 Background

In todays world, globalisation has succeeded in bringing the world together through trade.

However, a far less known aspect of it is the struggle for moving products in the most efficient

manner. Indeed, as the volume of products increases, it is becoming more vital to rely on

more scientific ways to optimise route planning avoiding delays of the execution plan or missing

opportunities to deliver items. In this particular context, a retail company sought the aid of Data

Science Company Satalia providing with data from every stage of delivery process to estimate

the delivery time of orders.

3 Acknowledgements

I would like to express my deepest appreciation to my supervisor Professor Chris Nemeth,

whose selfless time and care were sometimes all that helped me endure, Mrs Vega and Satalia

for trusting me with this project and provide ample guidance and Dr Simon Tomlinson for

bringing all these opportunities to Lancaster University. Finally, a thank to all my Professors

for beating in me the pursue of excellence under the most dire circumstances.

4 Literature Overview

The travelling salesman problem (TSP) is a great chapter in Operational Research by asking

the following question: Given a list of cities and the distances between each pair of cities, what

is the shortest possible route that visits each city exactly once and returns to the origin city?

or more scientifically expressed given a length L, the task is to decide whether the graph has

any tour shorter than L. Having arisen from business cycles of travelling merchants in Europe in

the first half of 19th century, W.R. Hamilton and Thomas Kirkman,the problem was formulated

in 1930 when intensively studied problems in optimization. However, it was as late as 1950s

8
before another touchstone was reached with the seminal paper of George Dantzig, Delbert Ray

Fulkerson and Selmer M. Johnson[1], expressing the problem as an integer linear program and

developing the cutting plane method for its solution using an example of 49 cities with a string

model. Later, Christofidis[2] developed the worst-case scenario in 1976. Owing to the speed

and simplicity of algorithm, many hoped it would pave the way to a near optimal solution

method. In the 1990s, Applegate, Bixby, Chvtal, and Cook[3] and most recently, Cook and

others computing an optimal tour through an 85,900-city instance given by a microchip layout

problem,

Part III

Methodology

5 Pre-Processing

5.1 Data

The data was provided by Satalia and consists of four different data sets which contains infor-

mation for tracking both orders and items in every stage of delivery procedure . Firstly, DFS

DEL HIST data set holds information about the delivery of products in the delivery branches

distributed to their final destination. In particular, this data set bears the branch id, the loca-

tion of delivery branch and expected date and time of delivery, along with the route and the

van id through the products will be transported to the branch. Secondly, DFS DEL DET data

set gives a description of the products characteristics to be delivered. More specifically, DFS

DEL DET contains information such as the items weight, volume, number of individual pieces,

products category and a brief description. Moreover, it is included a notice whether delivery

expert is supposed to expect a payment or an item on loan by the .The aforementioned data sets

9
were merged using inner join command of dplyr package. However,working in R it was deemed

necessary to reduce the dimensions of both DFS DEL HIST DFS DEL DET due to memory

storage restrictions and to their irrelevance to the given task. Thus, many covariates dropped

giving rise to the DFS DEL dataset. A description of the resulting data set is given below

Table 1: Description of DFS DET Data Set

Coveriate Name Data Type Description

order number Integer Order id used as primary Key

product type Factor

item volume Integer Volume of item measured in

Individual pieces Factor Number of pieces of a particular order

item weight Integer Weight of item measured in

Collection Item Logical A Binary Variable indicating whether there is an item on loan needed to be retrieved

delivery address postcode String

delivery time Time Stamp The date that delivery is scheduled

Table 2: The largest proportion of information was decided not to be included to the the final

data set. Other covariates needed to be transformed to facilitate their use by machine learning

algorithms

[H] Indeed, both Collection Item and delivery address postcode were transformed to binary

variables and to used only the three first digits to facilitate clustering amongst observations.

Moreover, it was deemed to use variables Individual pieces and product type as factors rather

than integer due to the special meaning of numeric values which is associated with categories.

Turning to the third data set, ApoloOrders holds data about the deliveries such as the order id,

the time of arrival and departure, location of delivery destination giving both postcode address

and the longitude and latitude of the location, date time, slot that delivery took place and time

spent with client. In this dataset is found a key information the timeAtDoor covariate.Similarly,

it was decided to join the data set using Order id as primary key leaving only the features in in

10
the final data set Therefore, the aforementioned features were selected to merge with DFS DET

to compose the final data set. Below, it is given a short description of the emerged data set.

Table 3: Description of DFS DET Data Set

Coveriate Name Data Type Description

order number Integer Order id used as primary Key

product type Integer used as Factor

item volume Integer Volume of item measured in

Individual pieces Factor Number of pieces of a particular order

item weight Integer Weight of item measured in

Collection Item Logical A Binary Variable indicating whether there is an item on loan needed to be retrieved

delivery address postcode String

delivery time Time Stamp The date that delivery is scheduled

lay Real Latitude that delivery Location

lng Real Longitude of the delivery Location

slotype Categorical Slot that delivery took place

timeAtDoor Real Time that Delivery Expert Spent with Client

Table 4: A lot of key information was found in AppoloOrderDetails benefiting the data with the

actual whose aid is in incorporating spatial-temporal information to the data set

[H]

By enriching the data, it becomes feasible the addition of more advanced modelling tech-

niques such as Spatial model and Gaussian Processes to the analysis. Finally, HistMasterDrops

illustrates the itineraries of each delivery vehicle such as the number and location of each stop

along the route and information pertaining to the proportion of time spent actively during the

journey. Although, it is not considered necessary to merge this dataset with previous data set,

it is asked to endeavour to find a smart way of joining the two data sets. In absence of common

columns that could serve as primary key, it was decided to use postcode and dates columns to

join the emerged dataset with histMasterDrops.

11
5.2 Feature Engineering

Having obtained the final dataset, it was decided that extra features should be geared to firstly

reduce further the dimensions of the data and secondly facilitate their inclusion on the models.

Thus, it was created the attribute of TimDiff which is the time interval between the arrivalTime

and departureTime. It is believed that this conversion would help implement time interval

information in the model- something that would be infeasible employing the original columns.

Secondly, it is deemed necessary to convert timeAtDoor column from seconds to minutes to

facilitate the assessment of performance of models. Finally, it was decided to create a column

with the first three or four digits of postcode in an attempt to cluster together observations lying

juxtapose in the space for Spatial Models.

6 Exploratory Data Analysis

The next stage of the analysis involved an exploratory analysis on the data. Although, there is a

plethora of covariates that intuitively influence the duration of a delivery, the analysis attempts

to retrieve key relationships in the data. In doing so, a number of graphical tools will be

entailed to disseminate information. Once the power of the relationships has been established,

the statistically important features will be used as covariates in machine learning algorithms.

Additionally, spatial-temporal correlations will be taken into account due to the inclusion of

models whose parameters involve amongst observations in the data set. At the beginning, one

of the features that mostly of the item is the product code .Subsequently, analysis revolved

around of how timeatDoor differs with respect the product category. In this case, if this holds

then mean of each product would differ. Thus, to present tangible evidence of, it is decided to

plot the mean estimated delivery time with respect to product type.

12
Scatter plot of Mean Time per Product Type

Figure 1 : The graph depicts clear pattern in early product codes. Then, the weight of products

increases in late product type, there is a break from the pattern which

[H] As it is expected, similar products need approximately the same amount of time to be

delivered as the graph depicts. More importantly, major deviations are illustrated amongst

clusters of product categories. Indeed it is natural for categories 2 ,3 and associated with to take

significantly less amount of time to be delivered compared to products such as sofas or mirrors.

However, displaying the mean of each categories is not enough to be able paint the larger picture.

On the contrary, it is needed a more thorough treatment of the data to let them fully express

for themselves. Thus, to obtain a better picture of how the delivery time is distributes within

each product category, a bar plot is plotted

13
Bar plot of Estimated Delivery Time Per Product Type

Figure 2 : The weights of items found in each product category is increasing as the product

code increases. Additionally, the presence of outliers diminishes as the product type

[H] As it can be observed, outliers are prevalent especially in the earlier categories where.

However, it is of interest to discover the reason behind the difference between delivery time

within product categories. At this point, analysis turns to other covariates for explaining the

great divergence between delivery time. Initially, it is thought that irrespective of which category

items belonged, weight and number of the individuals items play a significant part in the inflation

of delivery time. This holds because heavier items intuitively require more time to be transported

from delivery vehicles stopping location to the delivery address. Secondly, items comprised of a

large number of individual pieces need to be assembled in delivery location devoting time which

is added to the overall delivery time. However, this seems to be a more challenging task than

it appeared to be since delivery specialist often do not monitor the time especially with items

corresponding to certain categories. This results in having incomplete information about the

actual duration of the delivery.

14
Bar plot of Item Weights Per Product Type

Figure 3 : The weights of items found in each product category is increasing as the product

code increases. Additionally, the presence of outliers diminishes as the product type

[H] The resulting plot seems to have imperfect information for certain product categories

related to heavier items. This disquieting ritual amongst delivery experts would have its impact

on model whose performance will be hampered by the limitation of collected data. Turning

to the individual pieces, a bar plot shows the distribution of Delivery time with respect to the

individual pieces.

15
Bar plot of Delivery Time per Number of Products Individual Pieces

Figure 4 : The graph shows that as the number of individuals pieces increases so does the

estimated delivery time. Moreover, it is prevalent existence of outlier corresponding to larger

item in the data

[H] In the first few categories the median is largely unchanged, however there is upward

trend in the behaviour of median. Conversely, as the product type changes moving to larger

item categories delivery times are homogenised causing the extinction of outliers. Turning to

other factors that might influence time spent with clientle, is considered. Again, it seems that

the number of individual items influences the time spent with client since it is observed a stable

increase of time as the number of pieces is increasing with steeper increase on products comprised

of 6 items. Another observation that needs to be stressed is the existence of outlying observations

in the first most possibly connected with excessive weight

16
Bar plot of Estimated Delivery Time With Respect to Item Collection

Figure 5 : The plot shows no particular difference in delivery time on the different time slot.

However, there many outliers in the morning slot

[H] This may be resulted from traffic and stopping restrictions on vehicle during rush hours

spurring delivery vehicle to stop or search for suitable parking space that does not fall into

parking restrictions. However, on the whole large deviations of time are not observed in the bar

plot leading to confirm initial expectations.

17
Bar plot of Mean of Estimated Time At Door per Number of Orders

Figure 6 :The graph shows mean of delivery time against the number of orders sorted by date.

As it can be observed, as the number of orders increases , the fewer the chances are to have

available data. On the contrary, it seems that most single orders concern larger item
[H]

7 Statistical Modelling

To forecast the TAD, it was decided Random Forests and Deep Learning neural networks meth-

ods to be employed due to methods robustness and their capacity to model large scale problems

with complex relationships. Secondly , because of the results from Explanatory data analysis cor-

roborate that postcode carry enough predictive power, Multilevel Modelling along with Spatial

Temporal Models were considered as modelling options using postcodes to cluster observations

and investigate the spatial pattern which they follow. Lastly, Gaussian Processes were employed

due to the existence of latitude and longitude for each order. Below the main utilised algorithm

are described.

18
7.1 Multilevel Modelling

Multilevel Models implement linear or generalised linear regression models on clustered data[4].

This is achieved by allowing the intercept and coefficient vary for each cluster in the data. In

particular, the Multilevel Models are comprised of levels of hierarchies. In each level there is

regression model which give rise to the next hierarchies with last layer being the level where

observation lie. To achieve this the models coefficients contain fixed and random effects which

are defined in level than the model formula.

y = 0 + 1 x1 + . . . + n xn + t

. . . 0 = 00 + 01 1 . . . + = + u s

The global coefficients have similar properties with their opposite numbers in simple linear

regression and are given by solving the same equations. Whereas, random effects part are given

by random number drawn from a standard Normal. The random part discerns the different

clusters in data giving a more flexible and expressive model.As for the assumptions[5] governing

the Multilevel models are very similar to simple linear models.

Linearity The assumption of linearity states that there is a rectilinear (straight-line, as opposed

to non-linear or U-shaped) relationship between variables.

Normality The assumption of normality states that the error terms at every level of the model

are normally distributed.

Homoscedasticity The assumption of homoscedasticity assumes equality of population vari-

ances

Independence of observations Independence is an assumption of general linear models, which

states that cases are random samples from the population and that scores on the depen-

dent variable are independent of each other.However, multilevel models usually deal with

19
cases where the assumption of independence is violated. Thus, multilevel models alter the

hypothesis of simple linear regression by assuming independence of within and in between

different level residuals.

7.2 Neural Networks

Neural Networks have infinite capacity of modelling complex and large data such as stock prices

[6] and [7]. However, most noticeably Neural Net excel at High Dimensional Data[8] Artificial

Neural Nets are processing units which resemble the neuronal structure of human cerebral cortex

in a smaller scale. Neural Networks are organised in layers which by turn are comprised of

interconnected nodes[9]. These layers can be categorised in input, hidden, output layers where

the first two are concerned with receiving and presenting information. As for the latter, it is

burdened with disseminating the information from input layer to identify patterns in the data.

The artificial neural net learns to classify instances by being exposed to patterns linked with

categories found in the data set in question. In doing so, neural network extends the notion

of linear models by making the basic functions (x)andw to be dependent on parameters

using basics functions[10]. This adjustment gives rise to the basic neural network. Then, a

learning rule- most frequently delta method is employed to adjust connections weightsw . This

is achieved by implementing gradient descent algorithm within the solution vector space to find

a global minimum in order to minimise errors.In particular, to update the weights gradient

descent algorithm is used

wk+1 = wk E(wk ) (1)

where is the learning rate- a indication how small the step toward the direction of error function

n
X n
X
E(w) = E(w )En = (yn tn )2 (2)
=1 =1

20
w is the weight attached to the connection

This formula adjusts the weight towards the directions greatest decrease of the error with

a small step to safeguard against the prospect of local minima. However, this gradient descent

algorithm is not feasible to be applied , without the aid of Back propagation

n
X
E(w ) = (yn f (wT x )2 (3)
=1

In particular in this context is


1
f (x) = (4)
1 + ew x

Taking partial derivatives

En
= f (u)(1 f (u)) = f (u)f (u)
u

However, due to the existence of activation function,

X
a = f ( w x ) (5)

each unit computes using the chain rule

En En a u
= = (y t)y(1 y)x (6)
w u w

Turning to the Deep Learning Networks[11], they work in the same fashion as Neural Net-

works. However, the difference lies in the number of hidden layers involved. Moreover, in this

particular context the developed net is a feed forward deep learning network with a 10 with

7.3 Gaussian Processes

Gaussian Processes can be seen as generalisation of linear and polynomials models. In particular,

instead of making assumptions about what kind of curve could fit the data, a less parametric

21
approach is taken. This approach enables to see the data as incarnations of points coming from

multivariate Normal[12]. In this context as in many other, the mean function is assumed to be

0 with each point correlating to each other according to

(x x0 )2
k(x, x0 ) = f (7)
2l2

where f the maximum allowable covariance is defined as this should be high for functions

which cover a broad range on the y axis. Each observation y can be thought of as related to an

underlying function f(x) through a Gaussian noise model:

y = f (x) + N (0, n2 ) (8)

Just for simplicity of exposition, the noise is folded into k(x, x0) ), by writing

(x x )2
k(x, x ) = f2 exp + f2 (x, x ) (9)
2l

where (x, x is the Kronecker delta function. Thus, given n observations y, our objective is to

predict y, not the actual f their expected values are identical according to , but their variances

differ owing to the observational noise process. Calculating the covariance matrix[?]

k(x1 , x1 ) . . . k(x1 , xn )


k(x , x ) . . .
2 1
K=
K = (k(x , x ), . . . , k(x , x )) K = k(x , x1 )[13] (10)
1 n
.
.. .
..





k(xn , x1 ) . . . k(xn , xn )

Then, using the assumption that data comes form Multivariate Normal

t
y K K
N (0, ) (11)

y K K

where T indicates matrix transposition. We are of course interested in the conditional probability

22
p(yy): given the data, how likely is a certain prediction for y

y |y N (K K 1 y, K K K 1 K )

with mean estimate and variance

y = K K 1 y

var(y) = K K K 1 K

[13]

7.4 Decision Trees

Decision trees belongs to the category of supervised algorithms employed to classify both cate-

gorical and continuous data. In particular, a decision tree is a flowchart tree in which through

successive testing in each internal node[14]. Branches emerge denoting the outcome of the test

leading to further testing or a leafing node which holds the class label of the instance. Given a

training instance ,it goes through a pattern of questions in which is decided to which category

should be assigned. However a common problem is how to organise the sequence of ques-

tions Therefore, a number of splitting criteria has been developed to ensure the homogeneity of

emerging subclasses. Suppose a list of attributes of dataset A = (S1 , S2 , . . . , Sn ) . The split is

performed in the most informative attribute according to one of the following criterion

7.4.1 Chi-Square

Chi-square algorithm finds out the statistical significance between the differences between sub-

nodes and parent node. It is measured by sum of squares of standardized differences between

observed and expected frequencies of target variable. The algorithm works for categorical target

variable whose value the higher is higher the statistical significance of differences between sub-

23
node and Parent node.
(Actual Expected)2
Chisquare = (12)
(Expected)1/2

7.5 Information Gain

Although originated from Physics, Information gain or Entropy is a measure to define this degree

of disorganization in a system.Thus it allows to discern pure and less pure nodes and take split

according to which feature requires less information to be described, Entropy can be calculated

using formula:-Entropy, Decision Tree

Inf ormationGain = p log2 p + q log2 q (13)

Here p and q is probability of success and failure respectively in that node. Entropy is also

used with categorical target variable. It chooses the split which has lowest entropy compared to

parent node and other splits. The lesser the entropy, the better it is.

7.5.1 Other Splitting Criteria

Gini index says, if we select two items from a population at random then they must be of same

class and probability for this is 1 if population is pure.

It works with categorical target variable Success or Failure. It performs only Binary splits

Higher the value of Gini higher the homogeneity. CART (Classification and Regression Tree)

uses Gini method to create binary splits. The Gain Ratio criterion is a normalising version of

information which has the tendency to split towards test towards with many outcomes. Thus

algorithm which represents potential information by splitting training instances into n partitions-

the same number of outcome. After, the criterion is selected, the data set is split to all the

distinct values of the this attribute Sx . Subsequently, another spiting attribute is selected and

calculated based on the frequencies of the distinct values. This procedure continues until no

attribute is left.

24
7.5.2 Ensemble Methods

Despite the nice properties of decision trees, the algorithm generally suffers from high bias even

in its more complex incarnations. Thus, the aid of ensemble methods is entailed to reduce

variance of predictions. In doing so, bagging builds several weak learners on the different sub-

samples of the same data. A random forest is an implementation of the bagging paradigm which

grows a number of weak learners whose final prediction are combined using mean/median or

major voting approach to classify instances. Turning to boosting, similarly to other ensemble

methods, it employs weak learners to create stronger classification rules. To create such rules, it

must be firstly defined a weak learner. This is achieved by applying lose learning with different

distribution. Gradient Boost Tree encapsulates boosting paradigms. Typically, Gradient Boost

Tree grows an number of decision trees whose predictions are summed. However, the next

decision tree attempts to minimise the observed error by reconstructing the residuals from the

target function and gradient boosted trees.

Part IV

Results

7.6 Predicting Estimated Delivery Time

At this point, having identified important relationships in the data set, analysis turned to

predicting delivery time employing a number of machine learning algorithms. In this chapter,

it is attempted to developed models with two objectives foremost to minimising error as it is

portrayed through mean square error ,mean absolute error, mean and latter to predict 70 % of

orders delivery time within 3 minutes of time window error. Initially, analysis turned to the

most ubiquitous algorithms for handing this task- regression. Indeed, regression is utilised for

25
forecasting or predicting calendar events leading to believe that this course of action would be

the the best since it is simple and it could be potentially used as benchmark for more advanced

methods. Therefore a simple linear regression model ids developed .In more detail, the model

along with a full description is given below

Line Plot of Predicted against Actual Predicted Time at Door

Figure 8 : The plot shows that the predicted is following the actual line indicating that in most

cases there the tow values are close. However, another feature that capture the attention is the

occasional large estimated delivery time

[H] Turning to the performance of the model, the aforementioned linear regression model suc-

ceeded in predicting 68.76%of observations within three minute time window and rmse 7.029438

However, it is becoming more popular to explore the spatial structure of the observations for

enhancing the performance of model. Indeed, spatial information such as address postcodes can

capture hidden relationships in the data[15]. This information might be able to add another

piece since it allows observation to be clustering according to spatial information- postcodes.

This is of importance since observation lying proximate on the map might have similar duration

26
Table 5: Output of the Multilevel Model-Random Effects

Name Variance Std.Dev.

new PCod (Intercept) 26328 162.3

Residual 161291 401.6

Table 6: Output of the Multilevel Model-Fixed Effects

Covariates Estimate Std. Error t value

(Intercept) 1463.09 21.60996 67.70

item weight -0.26362 0.04454 -5.92

product type 4.62219 0.15899 29.07

individual pieces -36.98786 2.79370 -13.24

collection item -94.08842 9.84325 -9.56

and be affected differently by covariates. Thus, a Multilevel model is entailed to develop differ-

ent regression model for each cluster of observations by letting intercept and coefficients vary

Below, it is found a description of the model and hierarchies structure of the model

y = 0 + 1 xitemw eight + 2 xproductt ype + 3 xindividualP ieces + 4 xcollectionitem + t

0 = 00 + s

Having identified the model parameters, focus is shifted to the performance of the model. The

model identified 72.51% of the observation within three minute time interval with root mean

square and mean absolute error 6.13 and 3.72. Below it is depicted a plot of both predicted

actual values for graphical inspection.

27
Line Plot of Predicted against Actual Predicted Time at Door

Figure 9 : The plot shows that the predicted is following the actual line indicating that in most

cases there the tow values are close. However, another feature that capture the attention is the

occasional large estimated delivery time

[H] From the plot, it is ostensible that there is alarming pattern of predicting the duration

of delivery . Turning to another popular method, Support Vector Machines is employed to

predict delivery time. However, in this particular case, it verified that initial expectations that

it would not perform equally well as other methods due the the complexity of data without a

smart choice for kernel. Support Vector Machines managed to predict 69% of observation in the

desired interval with inflated mean absolute error 7.20 and root mean square error 4.24

28
Line Plot of Predicted against Actual Predicted Time at Door

Figure 10 : The plot shows even though in most cases delivery time is predicted close the

actual value, there is a systematic pattern of inflating the estimated delivery time

[H]

7.6.1 Decision Trees

Leaving the sub par performance of Support Vector Machine behind, it is time to attempt

to predict delivery time utilising more advanced methods- decision trees machine algorithm.

However, due to some limitations, it is considered more beneficial to employ ensemble methods

with decision trees such as Random Forest and Gradient Boosted Trees. Turning first to Random

Forest, the algorithm succeeds in predicting 70% within a 3 minute time interval. Additionally,

the performance is reasonable giving respectively for Mean Square Error, Mean Absolute Error

and Root Mean Square Error. Below it can be found a plot illustrates the performance of the

Random Forests

29
Figure 11 : The plot shows that the predicted is following the actual line indicating that in

most cases there the tow values are close. However, another feature that capture the attention

is the occasional large estimated delivery time

[H] Furthermore, rf package used to developed the random forest supports graphical assess-

ment of the importance of variable in the model. Below, it is depicted a plot illustrating the key

variables in the model.

30
Importance Plot of Variables in Random Variables

Importance Plot of Variables in Random Variables

Figure 12 : plots show that collection item and product type play the most significant part

in estimating duration of delivery followed by item weight and the number of individual pieces

that order is comprised

[H] From the plot, it is concluded that the statistically important is somewhat counter-

intuitive to what it is expected based on the explanatory data analysis As for the Gradient

Boost Trees, the performance is similar to Random Forest with slight increase reaching 73% in

the percentage of correctly predicted observations within a 3 minute error. In particular, both

mean absolute error and root mean squares error are estimated at 3,73 and 6.22 respectively.

7.6.2 Deep Neural Networks

Finally, no effort would be complete without trying to develop a Deep Learning network. Due to

their capacity of dealing with complex and large data such as stock prices [6] and [7]. However,

most noticeably Neural Net excel at High Dimensional Data[8]. In particular, it is developed

a recurrent deep learning network with 3 hidden layers comprised of 10 nodes each with 500

31
epochs succeeding to correctly predicting 77% with three minute error. More specifically, mean

absolutely error is 2,75 and root square error experience is decreased to 3.012 compared Decision

trees
Line Plot of Predicted against Actual Predicted Time at Door

Figure 13 : The plot shows that the predicted is following the actual line indicating that in

most cases there the tow values are close. However, another feature that capture the attention

is the occasional large estimated delivery time

[H] It is apparent that it certainly deserved the time and resources to develop a Deep Learning

approach for this problem given the rewarding results.

7.6.3 Gaussian Processes

Although, industry is moving towards Deep Learning for the discussed above. However, it is

not the only option available. Ultimately,Gaussian Processes are found particular application in

modelling complex system [16] which include spatial-temporal information[17]. Thus, a model

is employed to predict delivery time using both item characteristics and spatial information.

In contrast to other method it is only feasible to implement in Amazon Web Services since its

32
computation complexity outweigh the computers resources. As for the performance of algorithm

put into shadow those previously discussed

Scatter Plot of Predicted against Actual Predicted Time at Door

Figure 14 : The graph shows that those participant who dropped the survey had higher

Emotional Risk Scores. Furthermore, the existence of outliers is observed in population of

participants who remained in the survey

[H]

7.7 Recommendation

In this part, no report would be complete without providing a list of recommendation that have

emerged from the scrutinisation over the project. It is hoped that this piece of work will provide

for future worker.

The following dataset DFS DET DEL, DFS DET HIST and AppoloORdersDetails provide a

useful insight into the problem and DFS DET DEL, DFS DET HIST should be combined

using inner join. The result data set should be merged with AppoloORdersDetails left

33
join.

Moreover, the usefulness of dataset has been circumvented by the lack of a primary key. It is

proposed to be joined using a combination of. It is recommended not to pay particular

attention to this dataset when no external data is considered.

Both explanatory data analysis revealed that individual pieces, item weight, product type play

the most important role in predicting time at Door.

Deep Learning, Gaussian Processes and Random Forests were amongst the machine learning

algorithms that bore the most fruitful results, Future workers should pay more attention on

optimisation the performance of those models or start by considering first those techniques.

Part V

Discussion
In an attempt to rethink routing optimisation problem as part of project, a plethora of ma-

chine learning algorithms were entailed to predict time at door. In particular, analysis bore

that although a number of factors influence the time spent with client, the type of product,

whether there is an item for collection and individual pieces consisting of item play the most

significant part in estimating time at Door. Secondly, all machine learning algorithms utilised

performed equally satisfactorily with the percentage of estimated time being within the three

minute window error lying in the almost 74%. However, when Gaussian Processes and Deep

Learning performed significantly better when the process was repeated in cloud computing plat-

forms. This the reason behind the shift of the industry to computational intensive methods

such as Deep Learning and in particular Gaussian Processes. On the whole, machine learning

algorithms successfully tackled the problem of predicting the time that a delivery expert needs

34
to stay with client with the majority of observations predicted within three minute interval from

the correct time Most importantly, this project stands on the same side with a few which en-

deavour to bring closer Operational Research and Data Science under the umbrella of Business

Intelligence. However, even thought the developed models bore tangible results, for the future,

it could be constructed a primary key to facilitate merging or use a combination of existing.

In addition, it was not feasible to enrich the data with external resources. In summary, the

present study adhere the aid of machine learning algorithms to predict time at Door. In doing

so, analysis bore tangible results qualifying state of art algorithms such Gaussian Processes and

Deep Learning for tackling this task.

References

[1] V. Chvatal, W. Cook, G. B. Dantzig, D. R. Fulkerson, and S. M. Johnson, Solution of

a large-scale traveling-salesman problem, 50 Years of Integer Programming 1958-2008,

pp. 728, 2010.

[2] N. Christofides, Worst-case analysis of a new heuristic for the travelling salesman problem,

tech. rep., Carnegie-Mellon Univ Pittsburgh Pa Management Sciences Research Group,

1976.

[3] D. L. Applegate, R. E. Bixby, V. Chvatal, and W. J. Cook, The traveling salesman problem:

a computational study. Princeton university press, 2011.

[4] M. Kuhn and K. Johnson, Applied predictive modeling, vol. 810. Springer, 2013.

[5] J. J. Faraway, Extending the linear model with R: generalized linear, mixed effects and

nonparametric regression models, vol. 124. CRC press, 2016.

[6] B. Mandelbrot, Forecasts of future prices, unbiased markets, and martingale models,

The Journal of Business, vol. 39, no. 1, pp. 242255, 1966.

35
[7] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, Predicting the sequence speci-

ficities of dna-and rna-binding proteins by deep learning, Nature biotechnology, vol. 33,

no. 8, pp. 831838, 2015.

[8] I. Arel, D. C. Rose, and R. Coop, Destin: A scalable deep learning architecture with

application to high-dimensional robust pattern recognition., in AAAI Fall Symposium:

Biologically Inspired Cognitive Architectures, 2009.

[9] P. Flach, Machine learning: the art and science of algorithms that make sense of data.

Cambridge University Press, 2012.

[10] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.

[11] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp. 436

444, 2015.

[12] D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques. MIT

press, 2009.

[13] M. Ebden et al., Gaussian processes for regression: A quick introduction, The Website

of Robotics Research Group in Department on Engineering Science, University of Oxford,

2008.

[14] L. Rokach and O. Maimon, Data mining with decision trees: theory and applications. World

scientific, 2014.

[15] K. E. Pickett and M. Pearl, Multilevel analyses of neighbourhood socioeconomic context

and health outcomes: a critical review, Journal of Epidemiology & Community Health,

vol. 55, no. 2, pp. 111122, 2001.

36
[16] N. Chen, Z. Qian, I. T. Nabney, and X. Meng, Wind power forecasts using gaussian

processes and numerical weather prediction, IEEE Transactions on Power Systems, vol. 29,

no. 2, pp. 656665, 2014.

[17] Y. Xie, K. Zhao, Y. Sun, and D. Chen, Gaussian processes for short-term traffic vol-

ume forecasting, Transportation Research Record: Journal of the Transportation Research

Board, no. 2165, pp. 6978, 2010.

37

You might also like