Faculty of Computer Science and Mathematics Sta555: Fundamentals of Data Mining

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

FACULTY OF COMPUTER SCIENCE AND MATHEMATICS

STA555: FUNDAMENTALS OF DATA MINING

E-COMMERCE SHIPPING DATA


USING DIRECTED DATA MINING TECHNIQUES

GROUP: D2CS2414A
MEMBERS:
1. INTAN SYAFIQA SOFIA BINTI MOHD ZAIDI (2020989229)
2. LUQMAN HAKIM BIN ZAINUDDIN (2020985105)
3. NURHAZIYAH ATHIRAH BINTI HUSMADY (2020988389)
4. NUR EMIESYA BINTI JAMALUDDIN (2021178615)
5. NUR EZATI BINTI ALIAS (2021113769)

LECTURER:
SIR MOHD NOOR AZAM NAFI

DATE:
12TH JULY 2021
CONTENT

TITLE PAGE

CHAPTER 1 OVERVIEW OF THE PROJECT

1.1. INTRODUCTION 3
1.2. PROBLEM STATEMENT 4
1.3. RESEARCH QUESTIONS 4
1.4. RESEARCH OBJECTIVES 4
1.5. SIGNIFICANCE OF PROJECT 5
1.6. SCOPE AND LIMITATION OF PROJECT 5

CHAPTER 2 LITERATURE REVIEW 6

CHAPTER 3 RESEARCH METHODOLOGY 9

CHAPTER 4 DATA ANALYSIS AND FINDINGS 27

CHAPTER 5 CONCLUSION 48

REFERENCES 49

2
CHAPTER 1 OVERVIEW OF THE PROJECT

1.1. INTRODUCTION

E-Commerce shipping, simply put, are shipping services employed for companies that
sell products over the internet that make shipping their products more manageable and
affordable. However, in an era when online shopping is becoming increasingly popular and
companies are racing to the bottom to provide the best online shopping experience at the lowest
cost, E-Commerce shipping has come to mean a few different things.

E-Commerce shipping has become more competitive than ever with large retailers like
Amazon and Walmart making shopping faster and cheaper for their customers. While shipping
via postal services like UPS and FedEx may work fine for companies that are just starting out
with a few packages a day, as your business grows and attracts more customers, you’ll need to
find faster, cheaper, more efficient shipping resources in order to keep up and meet customer
expectations. That’s where E-Commerce shipping services come in.

Apart from that, electronic commerce or e-commerce is a business model that lets firms
and individuals buy and sell things over the internet. E-commerce, which can be conducted
over computers, tablets, or smartphones may be thought of like a digital version of mail-order
catalogue shopping. E-commerce has helped businesses establish a wider market presence by
providing cheaper and more efficient distribution channels for their products or services. For
example, the mass retailer Target has supplemented its brick-and-mortar presence with an
online store that lets customers purchase everything from clothes to coffeemakers to toothpaste
to action figures.

3
1.2. PROBLEM STATEMENT

E-commerce has grown at an incredible rate since conception, and so has the
competition among online sellers. This is one of the challenges faced by ecommerce companies
today. Although the online approach has made shopping a lot easier for consumers, it has also
brought unique challenges for ecommerce companies. The rise of digitalization has
transformed the way companies operate. Customers no longer need to take a trip to brick and
mortar stores to make their purchases. Ecommerce companies still deal in goods and services,
but now this takes place across multiple touchpoints within an online environment. Not all
businesses are making money consistently. There are challenges standing in their way, big and
small alike. Developing an ecommerce business, especially preparing your online shop to cater
to your customers' needs, is hard. You have to take great care over everything, from website
maintenance through to customer service.

1.3. RESEARCH QUESTIONS

The research questions for this project are constructs as follows:

i. What is the best model by using decision tree, neural network and logistic
regression analysis?
ii. Which model from the three best models is the most fit model to predict the e
commerce shipping?
iii. Product shipment delivered on time or not?

1.4. RESEARCH OBJECTIVES

The Objectives of this study include:

i. To identify the best predictive models which are Logistic Regression, Neural
Network and Decision tree model used to diagnose e commerce shipping.
ii. To determine the best model to predict e commerce shipping data.
iii. To Meet E-Commerce Customer Demand.

4
1.5 SIGNIFICANT OF PROJECT

This research is conducted to study the E-Commerce shipping on decision tree, logistic
regression and neural network predictive model. Thus, this study will relatively contribute to
the dataset research to understand the E-Commerce shipping on decision tree, logistic
regression and neural network predictive model. For instance, E-commerce is about more than
just free shipping and fast delivery, though those can be integral parts of plan. Besides that, the
information from this research can serve as a future reference to other researchers on the subject
forecast.

1.6 SCOPE AND LIMITATION OF THE PROJECT

Every study has limitations. Study limitations can exist due to constraints on research
design or methodology, and these factors may impact the findings of the study. The study was
carried out purely for academic purpose, within the framework of time limit. The data could be
incomplete or have missing values. The limitation concerns the nature of the measures used.
The measures included in this research were all based upon the literature review that related to
E-Commerce shipping.

In this study the sample of 10999 observations of 12 variables was recorded and used
to analyze. The population is large enough suitable to our subject, data mining. This large
number of data was analyzed by using SAS Eminer software.

5
CHAPTER 2 LITERATURE REVIEW

E-commerce shipping or known as electronic commerce is the trading in


products or services using computer networks such as the Internet. Mobile commerce,
electronic funds transfer, supply chain management, Internet marketing, online
transaction processing, electronic data exchange (EDI), inventory management
systems, and automated data gathering systems are all examples of electronic
commerce. Online shopping websites for retail sales direct to consumers, providing or
participating in online marketplaces that process third-party business-to-consumer or
consumer-to-consumer sales, business-to-business buying and selling, and gathering
and using demographic data through web contacts and social media are just a few
examples of what an e-commerce business might do (Kutz, 2016).

In today's economy, e-commerce is a potentially rising industry. Traditional


boundaries will be replaced by a new technology, as well as a mechanism and media
for acquiring products and services, in the near future. The electronic payment system
opens the door to new international and national trade relationships. Besides,
consumers would benefit from e-commerce perks such as interactive communications,
quick delivery, and more personalization that would only be available through online
buying. New transaction security methods are in the works, and electronic commerce
will present both opportunities and threats for businesses and consumers to consider.
E-commerce, it appears, is still in its infancy. Malaysia has a small number of success
stories. Businesses must have faith in order to invest, and the general public must have
faith in order to buy online (Paynter & Lim, 2001).

Purchasing habits in Europe have evolved dramatically over the previous


decade, with a large number of consumers now shopping online. Physical products e-
commerce creates a considerable need for specialist delivery services, making last-mile
logistics more problematic. Home delivery services, in particular, which are frequently
chosen by online customers, contribute to the atomization of parcel flows, causing
specific challenges in metropolitan areas. Alternative delivery methods, on the other
hand, are rapidly expanding, particularly in major cities (Morganti et al., 2014).

6
2.1 MODEL 1: DECISION TREE

A mixed research method was used, in which eye-tracking technology was


combined with a comparative statistical study using a decision tree model, to obtain the
most objective information about the behaviour of different generations in virtual
spaces, while taking into account the specifics of this research subject. The goal of the
study was to see how the baby boomer, X, Y, and Z generations reacted to the most
relevant communication elements throughout the electronic purchase phase of a
browsing activity and secondly to determine which communication elements are
statistically significant during the electronic purchase phase while performing a
browsing task and to evaluate the non-expressed factual reactions of different
generations to the IMCT elements during the electronic purchase phase while
performing a browsing task (Sabaitytė et al., 2019)

Decision tree models (DTMs) were used because the goal of this study was to
identify the preferred parts of internet marketing during the buying phase. Decision tree
models were chosen not only for their ability to classify items by group, but also to
explain consumer choices regarding IMCT preferences in this situation. There are a
number of known DTM methods (CTR, QUEST, CHAID (Exhaustive CHAID) (Kass,
1980)), from which the DTM CHAID (Chi-squared Automatic Interaction Detection)
was chosen. The CHAID approach was used because of the dependent variable's
characteristics (generation – a categorical variable). The model consists of 9 nodes, 5
of which are endpoints. The decision tree is three nodes deep below the root node,
allowing the preference for IMCT elements to be defined hierarchically. The percentile
graphs for the dependence of the Gain and Index variables were used to analyse how
useful and appropriate the compiled DTM was for each generation. In the vast majority
of cases, the DTM was shown to be accurate and instructive in anticipating customer
behaviour (Sabaitytė et al., 2019).

7
2.2 MODEL 2: NEURAL NETWORK

In another study, a framework for user behavioural profiling is offered, and


customer behavioural patterns are applied in the e-Commerce environment to identify
customers. Policies such as user restriction or blocking can be implemented based on
activity control. In this study, two methodologies were used: neural network
classification and a measure of behavioural pattern similarity. According to the results
of a multi-layer perceptron with a back propagation learning method, there is less error
and up to 15.12% better accuracy on average. The findings suggest that as the number
of consumers grows, the neural network approach's accuracy in recognising client
pattern behaviour improves. The accuracy of the pattern similarity method, on the other
hand, falls (Sohrabi et al., 2014).

2.3 MODEL 3: LOGISTIC REGRESSION

Another case study is to analyse the relationship between logistics strategies and
logistics problems in the e-commerce of physical goods in order to present a general
normative model and draw some key managerial implications so as to help B2C
merchants design their logistics strategies. A combination of a literature review and a
numerous case-study technique was employed to complete this work. In particular, 28
case studies of significant B2C e-commerce merchants in Italy - with diverse business
models - in the major online businesses that sell items were done to examine the
relationship between the characteristics of the logistics problem and the logistics
approach used. First, it was possible to construct a thorough picture of both the logistics
problems and the logistics strategies by using case studies to collect both qualitative
and quantitative data. Second, several exploratory case studies were required to
establish a theory in an area where the literature was lacking (Ghezzi et al., 2012).

8
CHAPTER 3 RESEARCH METHODOLOGY

3.1 DATA EXPLORATION

3.1.1 Variable Scenario

This dataset contains 10 variables which are Warehouse block, Shipment mode,
Customer care calls, Customer rating, Cost of the product, Prior purchases, Product
importance, Gender, Discount offered, Weight (grams) and a target variable Reached
on time.

Target variable (Reached.on.Time_Y.N)

Y=yes, N=no

Variables description:

Warehouse block A, B, C or F
Shipment mode Flight, Road or Ship
Customer care calls 2, 3, 4, 5, 6 or 7
Customer rating 1, 2, 3, 4 or 5
Cost of the product 96 - 310
Prior purchases 2, 3, 4, 5, 6, 7, 8 or 10
Product importance Low, Medium, or High
Gender M(Male) or F(Female)
Discount offered 1 - 65
Weight (grams) 1001 - 7846
Reached on time Y=yes or N=no
Table 3.1 Variable Description

Variable Name Role Measurement Level


Warehouse_block Input Nominal
Mode_of_Shipment Input Nominal
Customer_care_calls Input Nominal
Customer_rating Input Ordinal
Cost_of_the_Product Input Interval

9
Prior_purchases Input Interval
Product_importance Input Ordinal
Gender Input Nominal
Discount_offered Input Interval
Weight_in_gms Input Interval
Reached.on.Time_Y.N Target Binary
Table 3.2 Variable Roles and Measurement Level

3.1.2 StatExplore

The StatExplore tool is a multipurpose tool used to examine variable distributions and
statistics in our data set. The tool generates summarization statistics.

Figure 3.1 Predictive Analysis for StatExplore

Figure 3.2 Variable Worth Diagram for All Variables

10
Figure 3.3 Variable Worth for Discount_Offered

Figure 3.4 Variable Worth for Weight_in_gms

11
Figure 3.5 Variable Worth for Prior_Purchase

Figure 3.6 Variable Worth for Customer_Care_Calls

12
Figure 3.7 Variable Worth for Product_Importance

Figure 3.8 Variable Worth for Customer_Rating

13
Figure 3.9 Variable Worth for Warehouse_block

Figure 3.10 Variable Worth Diagram for Mode_of_Shipment

14
Figure 3.11 Variable Worth for Gender

3.1.2 Replacement
The Replacement tool enables us to reassign and consolidate levels of categorical
inputs.

Figure 3.12 Predictive Analysis for Replacement

15
Figure 3.13 Results for Replacement Process

16
3.1.3 Data Partition

We can use a data partitioning tool to divide data sets into training, test, and validation
data sets. For preliminary model fitting, a training data set was employed. Validation
data is used to track and tune the model during estimation, as well as to assess the
model. We can utilize the test data set as an additional holdout data set for model
assessment.

Figure 3.14 Predictive Analysis for Data Partition

Figure 3.15 Output for Data Partition

17
3.1.4 Filtering

The filter tool creates and applies filters to our training data set and optionally to the
validation and test data set. We can use filter to exclude certain observations such as
extreme outliers and errant data that we do not want to include in our mining analysis.

Figure 3.16 Descriptive Analysis for Filter

18
Figure 3.17 Output for Filtering Process

19
3.1.7 Sample
The sample tool enables us to take simple random samples, nth observation samples,
stratified random samples, first-n samples and cluster samples of data sets. For any type
of sampling, we can specify either number of observations or a percentage of the
population to select the sample. If we are working with rare events, the sample tool can
be configures for oversampling or stratified sampling.

Figure 3.18 Descriptive Analysis for Sample

Figure 3.19 Output for Sample

20
3.2 Data Mining Technique

Data mining is a business process that involves sorting through large amounts
of data to find significant patterns and rules. The purpose of data mining is to discover
patterns that can be used by a business or any organization to aid in the day-to-day
operations of the firm. In data mining, there are two sorts of models: profiling models
and predictive models. When the target and inputs are from the same time frame, a
profiling model is used, and a predictive model is used when the target is from a later
time frame than the inputs. The main tool for directed data mining is the predictive
model. The historical data provides examples of all the target values and directed data
mining focuses on one or more variables that are targets. In other words, directed data
mining looks for patterns that explain the goal values rather than just any pattern in the
data. There are several tools for data mining, but we concentrated on only three
predictive modelling techniques in our project: decision trees, logistic regression, and
neural networks. In this section, we'll go through all of these strategies in detail. There
are eleven steps in the direct data mining process. First, we convert the business
challenge into a data mining problem in our project. Then we choose info that is
relevant. Then we learn about the data. We develop a model set after admitting the data.
After that, we repair data issues and transform data to bring information to the surface.
After that, we create and evaluate models. The next stage is to put the models into action
and evaluate the results. Finally, if there is an issue, we must start over.

3.2.1 Decision Tree

Because it can be applied on a wide range of problems and creates models that
explain how they function, decision trees are one of the most powerful directed data
mining approaches. A decision tree is a hierarchical set of rules that describe how to
divide many records into smaller groups. A root node, branches, and leaf nodes make
up the structure. Each internal node represents an attribute test, each branch represents
a test result, and each leaf node represents a class label. The root node is the highest
node in the tree. The data is divided into smaller groups through a decision tree, with
each new set of nodes having more purity than its ancestor in terms of the goal variable.
The best split is one that increases purity in the children the most, makes nodes of equal
sizes, or at the very least does not create nodes with a small number of records.

21
There are five types of splitting criterion which are Gini (population diversity),
Entropy reduction or information gain, Logworth, Classification and Regression Tree
(CART) and Chi-Squared Automatic Interaction Detection (CHAID) used in
conducting this study. Logworth, Entropy, and Gini are used for categorical targets.
CART is a supervised model, where it has a sample of the population withheld. CHAID
is an unsupervised technique, and it uses the entire model to build the tree.

Gini is a measure of how often a randomly selected element from the set would
be labelled erroneously if it were labelled according to the distribution of labels in the
subset. If the score is 0.5, two classes are equally represented. Because purer nodes
have better scores, the Gini score of the split must be maximized. The Gini score of a
fully pure node is 1. Purity is defined by entropy in the same way that machine learning
is defined. Purity is defined by entropy in the same way that machine learning is
defined. The entropy for a node is the sum, for all the target in the node, of all proportion
of records which certain value is multiplied by base two logarithm of that proportion.
Because the logarithm probability value is always negative, the sum is multiplied by -
1. The entropy is the weighted average of all the entropies of the children. The split that
reduces entropy the most is chosen by the decision tree. It is necessary to compute
Logworth chi-square statistics between the binary targets and all alternative splits to
discover the optimal split. Then figure out which of the chi-square statistics has the
highest logworth. Compare the best split across all input variables and pick the best
split with the highest logworth.

CHAID and CART are the two oldest types of decision trees. They are also the
most common types of decision trees used because they are easy to be understood. The
CART method grows binary trees and splits them indefinitely if fresh splits that boost
purity can be identified. It then chooses one of the candidate models to apply to the
validation set, with the tree with the lowest misclassification rate being chosen as the
final model. The Chi-Square test is a statistical criterion used by CHAID to stop tree
growth. The Chi-Square test compares observed results to theoretical values. Values
that deviate from theory and are independent are chopped off, and the tree comes to a
halt at that branch. The Chi-squared test is used to see if the distribution of validation
set results differs from the distribution of training set results.

3.2.2 Logistic Regression

22
When the dependent variable is binary, logistic regression is the proper
regression strategy to use. It's a foresight analysis. To describe data and explain the
relationship between one dependent binary variable and one or more nominal, ordinal,
interval, or ratio-level independent variables, logistic regression is utilized. The target
binary outcome takes only two value, 1 or 0. The binary value is then transformed into
𝑝
probability which are P(Y=1) =p and P(Y=0) =1-p. P is expressed into odds = 1−𝑝. This

is used to find the probability of an event to occur. Odds ratio is a measure of association
between an exposure and an outcome. It represents the odds an outcome will occur
given a particular exposure, compared to the odds of the outcome occurring in the
absence of that exposure. Logistic regression begins with the explanation of logistics
function which takes on value zero and one.

𝑝
The logistics function is: 𝑙𝑜𝑔𝑖𝑡 = 𝑙𝑛 (1−𝑝) = 𝛽0 + 𝛽1 𝑋

A method of maximum likelihood is used to find the best fit line for logistics regression.

𝑝
Thus, odds become: 𝑜𝑑𝑑𝑠 = (1−𝑝) = 𝑒 𝛽0+𝛽1𝑋

1
Solving the probability gives the result of: 𝑝 = 1+𝑒 −(𝛽0 +𝛽1 𝑋)

In our project, we have assigned eight logistic regression models connected to


the impute node. The models are Logistic Regression Main (LR MAIN), Logistic
Regression Inter (LR INTER) and Logistic Regression Poly (LR POLY) are included
in the model all sets of the variable used. Then, Logistic Regression Main Inter (LR
MAIN INTER), Logistic Regression Main Poly (LR MAIN POLY) and Logistic
Regression Inter Poly (LR INTER POLY) are when all two factors interaction for class
variable sets used were included in this part. Logistic Regression Main Inter Poly (LR
MAIN INTER POLY) includes in the model polynomial terms up to the degree
specified for all interval variables used. Poly Degree: specifies the polynomial degree
when the term was included in the model. Lastly, we used Logistic Regression Main
Stepwise (LR MAIN STEPWISE).

23
3.2.3 Neural Network

The structure and function of the human brain were used to create neural
networks. Learning algorithms that analyze any given categorization challenge are
known as artificial neural network models. To turn raw data into valuable information,
neural networks are utilized in data mining. They are capable of handling complex
models and, in comparison to statistical methods, are simple to comprehend. They adapt
themselves as they self-learn as new information is processed. Because neural networks
can generalize and learn from data input, they can be used to model neural connections
in human brains. Artificial neuron networks have a structure that is quite similar to
biological neuron networks in that they are made up of artificial neurons and
connections between them. When an artificial neuron receives input and produces
output, this is known as the node's activation function. The combination and transfer
functions are the two halves of the activation function.

The inputs are combined into a single value by the combination function, which
is then transferred to the transfer function to produce output. It usually employs a set of
weights for each of the inputs. Each input has its own monetary value. During the
training of the network, the best values are assigned to the weights. The weighted sum
is calculated by multiplying the input by its weight and then adding the products
together. The transfer function is the following function. A mathematical description of
the relationship between inputs and outputs is a transfer function. It defines how closely
an artificial neuron resembles a biological neuron in terms of behavior. The common
transfer functions used are step, linear, logistic, and hyperbolic tangent functions.

The input layer, hidden layer, and output layer make up the structure of an
artificial neural network. The input layer will normalize all inputs so that their ranges
are identical. The neural network's output layer is determined by the neural network's
target. If the target is continuous, a linear combination will be employed; if the target
is binary, a logistic function will be used. Non-linear activation functions are found in
the hidden layer. Each unit of the hidden layer is linked to every unit of the input layer.
The units are determined by multiplying the input value by its weight, adding them
together, then applying the transfer function.

In our project, we have connected six neural network nodes to the impute node.
The nodes are Neural Network (NN) 3 which means there are three hidden layers, NN5

24
indicates five hidden layers, NN8 which means there are eight hidden layers, NN11 has
11 hidden layers, NN14 has 14 hidden layers and NN20 has 20 hidden layers.

3.2.4 Model Comparison

The model comparison tool compares the performance of one or more distinct
prediction models using a validation or test dataset. It produces a report, a table of basic
error measurements, and a table of each model's prediction outcomes. When estimating
new records, a model that is overly flexible can lead to overfitting. We had to examine
three models in our project: a decision tree, logistic regression, and a neural network.
We had to figure out which models were overfit, underfit, and best.

To find an overfit model, look for the one with the biggest gap between train
and valid/test results across all performance criteria. We calculated the gap by
subtracting the value of train from the value of valid. For models other than overfit, we
must choose a model with a valid error value less than training for an underfit model.

The next step is to choose the best model. We filtered out the overfit model to
find the best model. The performance measures of the remaining models were then
compared. The shortest misclassification error, smallest average squared error, and
biggest roc index values from the valid column were used to select the best model.

25
3.2.5 Data Mining Tools

The essential process of data mining is the use of these tools. It makes it simple
to use exploratory statistical and visualization tools, choose and transform the most
relevant predictive factors, model the variables to predict outcomes, and confirm a
model's correctness, starting with a statistically representative sample of your data. To
perform our data mining and build a model, we used the tools provided in the SAS
Enterprise Miner. The tools provided are sample, explore, modify, model, and assess
which make up the word SEMMA. These are the functions of each tool:

1. Sample - We sampled our data by creating one or more data tables. The sample
extracted a portion of a large data set big enough to contain the significant
information, yet small enough to be manipulate quickly. We also advocated
creating partitioned data sets with the Data Partition node:
• Training -- used for model fitting.
• Validation -- used for assessment and to prevent over fitting.
• Test -- used to obtain an honest assessment of how well a model
generalizes.

2. Explore –We explored our data by searching for unanticipated trends and
anomalies to gain understanding and ideas. Exploration helped refine the
discovery process.

3. Modify – We modified our data by creating, selecting, and transforming the


variables to focus the model selection process. We also looked for outliers and
reduce the number of variables, to narrow them down to the most significant
ones.

4. Model– We modelled our data by allowing the software to search automatically


for a combination of data that reliably predicts a desired outcome.

5. Assess– We assessed our data by evaluating the usefulness and reliability of the
findings from the data mining process and estimate how well it performs.

26
CHAPTER 4: DATA ANALYSIS AND FINDINGS

4.1. Decision Tree

4.1.1. Flow in SAS Enterprise Miner

Figure 4.1

4.1.2. Fit Statistics

Figure 4.2

In Figure 4.2, DT CHAID has the largest gap compared to other models. Therefore, it can be
concluded that the model is an Overfit Model. With the lowest value of validation in
misclassification rate and average squared error and largest value of validation in ROC index,
it can be concluded that DT CART is the Best Model. For Underfit Model, there is no model
that has a smaller value in the validation set than the training set.

27
4.1.2. Confusion Matrix for DT CART

Figure 4.3

TRAIN
PREDICTED
ACTUAL

Positive Negative Total


Positive TP = 1953 FN = 1984 3937
Negative FP = 65 TN = 2596 2661
Total 2018 4580 6598

VALIDATE
PREDICTED
ACTUAL

Positive Negative Total


Positive TP = 1277 FN = 1349 2626
Negative FP = 52 TN = 1723 1775
Total 1329 3072 4401

i. True Positive Rate (TPR)


Train:
𝑇𝑃 1953
= = 0.4961
(𝑇𝑃 + 𝐹𝑁) (1953 + 1984)

The model’s ability to predict positive outcomes correctly for train is 0.4961.

Validate:
𝑇𝑃 1277
= = 0.4863
(𝑇𝑃 + 𝐹𝑁) (1277 + 1349)
The model’s ability to predict positive outcomes correctly for validate is 0.4863.

28
Concusion: Train model has better prediction for positive outcomes.

ii. True Negative Rate (TNR)


Train:
𝑇𝑁 2596
= = 0.9756
(𝑇𝑁 + 𝐹𝑃) (2596 + 65)

The model’s ability to predict negative outcomes correctly for train is 0.9756.

Validate:
𝑇𝑁 1723
= = 0.9707
(𝑇𝑁 + 𝐹𝑃) (1723 + 52)

The model’s ability to predict negative outcomes correctly for validate is 0.9707.

Conclusion: Train model has better prediction for negative outcomes.

iii. Classification Accuracy


Train:
(𝑇𝑃 + 𝑇𝑁) (1953 + 2596)
= = 0.6896
𝑇𝐶 6598

The percentage of correct prediction for model is 68.96%.

Validate:
(𝑇𝑃 + 𝑇𝑁) (1277 + 1723)
= = 0.6817
𝑇𝐶 4401

The percentage of correct prediction for model is 68.17%.

Conclusion: Model for train is better at predicting correct outcomes compared to


validate model.

iv. Classification Error


Train:
(𝐹𝑃 + 𝐹𝑁) (65 + 1984)
= = 0.3105
𝑇𝐶 6598

The percentage of wrong prediction for model is 31.05%.

Validate:
(𝐹𝑃 + 𝐹𝑁) (52 + 1349)
= = 0.3183
𝑇𝐶 4401

The percentage of wrong prediction for model is 31.83%.

Conclusion: Model for train is better compared to valid model as it has slightly lower
classification error rates.

29
4.1.3. ROC Chart for DT CART

Figure 4.4
In the Figure 4.4, we also can say that the train shows the best point gives 0.4961 sensitivity
(49.61%) and 0.0244 false positive fraction (97.56% specificity). Therefore, validate gives the
best point is 0.4863 sensitivity (48.63%) and 0.0293 false positive fraction (97.07%
specificity).

4.1.4. Lift Chart for DT CART

Figure 4.5
From Figure 4.5, we can conclude that in the top 20% of the cases, the training and validation
lifts are approximately 1.6759. This means that the proportion of primary outcome cases in this
top 20% is about 1.6759 times more likely to have the primary outcome that a randomly
selected 20% of cases.

30
4.1.5. Decision Tree for DT CART

Figure 4.6
Figure 4.4 shows the decision tree for DT CART from SAS Enterprise Miner. From the
decision tree, we can conclude that there are:
i. Most important input : Replacement Discount_Offered
ii. Number of rules : 13
iii. Number of rules to predict target = 1 :8
iv. Number of rules to predict target = 0 :5
v. Size of training data set : 6598
vi. Size of validation data set : 4401

31
4.2. Logistic Regression

4.2.1. Flow in SAS Enterprise Miner

Figure 4.7
4.2.2. Fit Statistics

Figure 4.8
In Figure 4.6, LR MAIN INTER POLY, LR INTER and LR MAIN INTER have the largest
gap compared to other models. Therefore, it can be concluded that the model is an Overfit
Model. With the lowest value of validation in average square error and largest value of
validation in ROC index, it can be concluded that LR MAIN POLY is the best model. For
Underfit Model, LR POLY has a smaller value in the validation set than the training set.

32
4.2.3. Confusion Matrix for LR MAIN POLY

Figure 4.9

TRAIN
PREDICTED
ACTUAL

Positive Negative Total


Positive TP = 2429 FN = 1508 3937
Negative FP = 748 TN = 1913 2661
Total 3177 3421 6598

VALIDATE
PREDICTED
ACTUAL

Positive Negative Total


Positive TP = 1582 FN = 1044 2626
Negative FP = 462 TN = 1313 1775
Total 2044 2357 4401

i. True Positive Rate (TPR)


Train:
𝑇𝑃 2429
= = 0.6170
(𝑇𝑃 + 𝐹𝑁) (2429 + 1508)

The model’s ability to predict positive outcomes correctly for train is 0.6170.

Validate:
𝑇𝑃 1582
= = 0.6024
(𝑇𝑃 + 𝐹𝑁) (1582 + 1044)

The model’s ability to predict positive outcomes correctly for validate is 0.6024.

Conclusion: Train model has better prediction for positive outcomes.

33
ii. True Negative Rate (TNR)
Train:
𝑇𝑁 1913
= = 0.7190
(𝑇𝑁 + 𝐹𝑃) (1913 + 748)

The model’s ability to predict negative outcomes correctly for train is 0.7190.

Validate:
𝑇𝑁 1313
= = 0.7397
(𝑇𝑁 + 𝐹𝑃) (1313 + 462)

The model’s ability to predict negative outcomes correctly for validate is 0.7397.

Conclusion: Validate model has better prediction for negative outcomes.

iii. Classification Accuracy


Train:
(𝑇𝑃 + 𝑇𝑁) (2429 + 1913)
= = 0.6581
𝑇𝐶 6598

The percentage of correct prediction for model is 65.81%.

Validate:
(𝑇𝑃 + 𝑇𝑁) (1582 + 1313)
= = 0.6578
𝑇𝐶 4401

The percentage of correct prediction for model is 65.78%.

Conclusion: Model for train is better at predicting correct outcomes compared to


validate model.

iv. Classification Error


Train:
(𝐹𝑃 + 𝐹𝑁) (748 + 1508)
= = 0.3419
𝑇𝐶 6598

The percentage of wrong prediction for model is 34.19%.

Validate:
(𝐹𝑃 + 𝐹𝑁) (462 + 1044)
= = 0.3422
𝑇𝐶 4401

The percentage of wrong prediction for model is 34.22%.

Conclusion: Model for train is better compared to valid model as it has slightly lower
classification error rates.

34
4.2.4. ROC Chart for LR MAIN POLY

Figure 4.10
In figure 4.10, train shows the best point gives 0.6170 sensitivity (61.70%) and 0.281 false
positive fraction (71.90% specificity). Therefore, validate gives the best point is 0.6024
sensitivity (60.24%) and 0.2603 false positive fraction (73.97% specificity).

4.2.3. Likehood Estimates

Figure 4.11
From the output in Figure 4.7, we can conclude that this model is fit where the p-value is less
than α = 0.05.

35
Figure 4.12
From Figure 4.8, we can conclude that the logistic equation is:
𝑝
log (1−𝑝) = 6.8352 - 0.0619 Customer_rating (1) - 0.0379 Customer_rating (2) +
0.0955 Customer_rating (3) + 0.0260 Customer_rating (4) - 0.0555
Gender (F) + 0.0214 Mode_of_Shipment (Flight) - 0.0198
Mode_of_Shipment (Road) + 0.1468 Product_importance (high) -
0.0763 Product_importance (low) - 0.8686 REP_Customer_care_calls -
0.0916 REP_Discount_offered -0.9119 REP_Prior_purchases -0.00125
REP_Weight_in_gms -0.0281 Warehouse_block (A) + 0.0482
Warehouse_block (B) -0.0101 Warehouse_block (C) + 0.0340
Warehouse_block (D) + 0.0285 (REP_Customer_care_calls)2 - 0.0129
REP_Customer_care_calls * REP_Discount_offered + 0.0416
REP_Customer_care_calls * REP_Prior_purchases + 0.000128
REP_Customer_care_calls * REP_Weight_in_gms + 0.0185
(REP_Discount_offered)2 + 0.000531 REP_Discount_offered *
REP_Prior_purchases - 0.00001 REP_Discount_offered *
REP_Weight_in_gms + 0.0213 (REP_Prior_purchases)2 + 0.000112
REP_Prior_purchases * REP_Weight_in_gms - 0.0000000157
(REP_Weight_in_gms)2

36
4.2.5. Lift Chart for LR MAIN POLY

Figure 4.13
From Figure 4.5, we can conclude that in the top 20% of the cases, the training and validation
lifts are approximately 1.6759. This means that the proportion of primary outcome cases in this
top 20% is about 1.6759 times more likely to have the primary outcome that a randomly
selected 20% of cases.

37
4.3. Neural Network
4.3.1. Flow in SAS Enterprise Miner

Figure 4.14

4.3.2. Fit Statistics for Neural Network Data Mining Techniques

Figure 4.15

In Figure 4.15, NN20 has the largest gap compared to other models. Therefore, it can be
concluded that the model is an Overfit Model. With the lowest value of validation in
misclassification rate and average squared error and largest value of validation in ROC index,
it can be concluded that NN3 is the Best Model. For Underfit Model, there is no neural network
models that has a smaller value in the validation set than the training set.

38
4.3.3. Confusion Matrix For NN3

Figure 4.16

TRAIN
PREDICTED
ACTUAL

Positive Negative Total


Positive TP = 2041 FN = 1896 3937
Negative FP = 165 TN = 2494 2659
Total 2206 4390 6596

VALIDATE
PREDICTED
ACTUAL

Positive Negative Total


Positive TP = 1320 FN = 1306 2626
Negative FP = 114 TN = 1661 1775
Total 1434 2867 4401

i. True Positive Rate (TPR)


Train:
𝑇𝑃 2041
= = 0.5184
(𝑇𝑃 + 𝐹𝑁) (2041 + 1896)
The model’s ability to predict positive outcomes correctly for train is 0.5184.

Validate:
𝑇𝑃 1320
= = 0.5027
(𝑇𝑃 + 𝐹𝑁) (1320 + 1306)
The model’s ability to predict positive outcomes correctly for validate is 0.5027.

39
Conclusion: Model for train is better in predicting positive outcomes compared to
model for validate.

ii. True Negative Rate (TNR)


Train:
𝑇𝑁 2494
= = 0.9379
(𝑇𝑁 + 𝐹𝑃) (2494 + 165)
The model’s ability to predict negative outcomes correctly for train is 0.9379.

Validate:
𝑇𝑁 1661
= = 0.9358
(𝑇𝑁 + 𝐹𝑃) (1661 + 114)
The model’s ability to predict negative outcomes correctly for validate is 0.9358.

Conclusion: Model for train is good in predicting negative outcomes compared to valid
model.

iii. Classification Accuracy


Train:
(𝑇𝑃 + 𝑇𝑁) (2041 + 2494)
= = 0.6875
𝑇𝐶 6596
The percentage of correct prediction for model is 68.75%.

Validate:
(𝑇𝑃 + 𝑇𝑁) (1320 + 1661)
= = 0.6773
𝑇𝐶 4401
The percentage of correct prediction for model is 67.73%.

Conclusion: Model for train is better at predicting correct outcomes compare to valid
model.

iv. Classification Error


Train:
(𝐹𝑃 + 𝐹𝑁) (165 + 1896)
= = 0.3125
𝑇𝐶 6596
The percentage of wrong prediction for model is 31.25%.

Validate:
(𝐹𝑃 + 𝐹𝑁) (114 + 1306)
= = 0.3227
𝑇𝐶 4401
The percentage of wrong prediction for model is 32.27%.

Conclusion: Model for train is better compared to validate model as it has lower
classification error rates.

40
4.3.3. ROC Chart For Neural Network

Figure 4.17

In figure 4.17, train shows the best point gives 0.5184 sensitivity (51.84%) and 0.0621 false
positive fraction (93.79% specificity). Therefore, validate gives the best point is 0.5027
sensitivity (50.27%) and 0.0642 false positive fraction (93.58% specificity).

4.3.4. Lift Chart For Neural Network

Figure 4.18
The lift chart shows that in the top 20% of the cases, the training lift is 1.675933 while
validation lift is 1.675933. This means that the proportion of primary outcome cases in this top

41
20% is about 1.675933 times more likely for training set and 1.675933 times more likely for
validation set to have the primary outcome than a randomly selected 20% of cases.

4.3.5. Misclassification Rate

Figure 4.19

The iteration plot shows the misclassification rate versus optimization iteration. The iteration
plot shows a convergence of misclassification error occurs near iteration 1 which is 0.323431.
The reference line indicates the optimal misclassification rate for the data which is the vertical
blue line.

42
4.4. MODEL COMPARISON

4.4.1. Predictive Modelling of 3 Best Model

Figure 4.20

43
4.4.2 Fit Statistics for Data Mining Techniques

Figure 4.21

In figure 4.21, NN3 has the largest gap compared to others models. Therefore, it concluded
that the model is an overfit model. For the underfit model, there is no models that has smaller
value in the validation set than the training set. Thus, we can conclude that the best model that
we choose is DT CART which has the lowest value of validation in misclassification rate,
lowest value of validation in average squared error and largest value of validation in ROC
Index.

44
4.4.3 Confusion Matrix for Best Model

Figure 4.22

As calculated previously:
i. DT CART :
TPR (VALID) = 0.4863
TNR (VALID) = 0.9707
ACCURACY (VALID) = 0.6817
ERROR (VALID) = 0.3183

ii. LR MAIN POLY :


TPR (VALID) = 0.3580
TNR (VALID) = 0.7397
ACCURACY (VALID) = 0.5572
ERROR (VALID) = 0.4428

iii. NN3 :
TPR (VALID) = 0.5027
TNR (VALID) = 0.9358
ACCURACY (VALID) = 0.6773
ERROR (VALID) = 0.3227

NN3 is better in predicting positive target since the sensitivity value is larger compared to other
models. DT CART is better in predicting negative target since the specificity value is larger
compared to LR MAIN POLY and NN3. DT CART has the highest accuracy rate compared to
other models. DT CART has the lowest error rate compared to other models. Thus, we can
conclude that DT CART is the best model out of others which are LR MAIN POLY and NN3
for predicting target outcomes for this data sets.

45
4.4.3 ROC Chart of 3 Best Models

Figure 4.23

The ROC Chart show little difference between the three models. This is consistent with the
values of the ROC Index, which equals the area under the ROC curves. DT CART’s ROC
Index turns out to be the highest and has the farthest model line from the straight line which
indicates that the model is good.

4.4.4 Cumulative Lift Chart of 3 Best Models

Figure 4.24

46
High values of cumulative lift suggest that the model is doing a good job separating primary
and secondary cases. The lift chart shows that in the top 20% of the cases, DT CART model
lift is the highest compared to other models which equal to1.675933. This means that the
proportion of primary outcome cases in this top 20% is about 1.675933 times more likely to
have the primary outcome than a randomly selected 20% of cases.

4.4.5 Gain Chart of 3 Best Models

Figure 4.25

From the gain chart, DT CART is the best model compared to NN3 and LR MAIN POLY as
by contacting 20% of the cases as the responds of DT CART at 20% is higher than NN3 and
LR MAIN POLY.

47
CHAPTER 5 : CONCLUSION

The ecommerce sector has seen unprecedented expansion in the last decade and a crucial part
of this industry is the delivery process. There have been multiple studies assessing how issues
in the delivery process have an effect on the consumer’s behaviour. All of agree that the
delivery is the most crucial part of an ecommerce shipping and must be made as efficient as
possible.

In this study we have found the most suitable model to predict the ecommerce shipping
data which is DT CART. Out of other models analysed by SAS E-Miner Software, decision
tree model turns out to be the best fit model compared to neural network model and logistic
regression model. This is due to its valid of Misclassification rate is the smallest among all,
valid of ASE is the smallest among all and valid of ROC Index is largest among all. This shows
that decision tree is better in handling categorical values.

48
REFERENCES
Ghezzi, A., Mangiaracina, R., & Perego, A. (2012, January 1). Shaping the E-Commerce

Logistics Strategy: A Decision Framework. SAGE Journals.

https://journals.sagepub.com/action/cookieAbsent

Kutz, M. (2016). Introduction to E-Commerce: Combining Business and Information

Technology. Bookboon.Com. https://irp-

cdn.multiscreensite.com/1c74f035/files/uploaded/introduction-to-e-commerce.pdf

Morganti, E., Seidel, S., Blanquat, C., Dablanc, L., & Lenz, B. (2014, January 1). The Impact

of E-commerce on Final Deliveries: Alternative Parcel Delivery Services in France

and Germany. ScienceDirect.

https://www.sciencedirect.com/science/article/pii/S235214651400297X?via%3Dihub

Paynter, John & Lim, Jackie. (2001). Drivers and Impediments to E-commerce in Malaysia.

6.

https://www.researchgate.net/publication/228747258_Drivers_and_Impediments_to_

E-commerce_in_Malaysia

Sabaitytė, Jolanta & Davidavičienė, Vida & Straková, Jarmila & Raudeliuniene, Jurgita.

(2019). Decision tree modelling of e-consumers’ preferences for internet marketing

communication tools during browsing. E a M: Ekonomie a Management. 22. 206-221.

10.15240/tul/001/2019-1-014.

https://www.researchgate.net/publication/331736204_Decision_tree_modelling_of_e-

consumers'_preferences_for_internet_marketing_communication_tools_during_brows

ing

Sohrabi Safa, Nader & Norjihan, Abdul & Ismail, Maizatul Akmar. (2014). An artificial

neural network classification approach for improving accuracy of customer

identification in e-commerce. Malaysian Journal of Computer Science. 27. 171-185.

49
https://www.researchgate.net/publication/272162740_An_artificial_neural_network_c

lassification_approach_for_improving_accuracy_of_customer_identification_in_e-

commerce

50

You might also like