Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics

Online Purchase Behavior Prediction and Analysis Using Ensemble Learning

Xiaotong Dou
College of Statistics and Mathematics
Zhejiang Gongshang University
Hangzhou, Zhejiang, China
e-mail: douxiaotong0803@163.com

Abstract—With the improvement of online transaction systems over time [4]. Wen Liang. C use the classification prediction
and online shopping platforms, more and more customers results of the linear model and the non-linear model to build
choose online purchase. However, because customers and an ensemble system to predict user behavior [5].
merchants cannot communicate face-to-face, merchants know With the development of data mining technology, more
very little about their customers' needs and cannot grasp their and more machine learning models are applied to the
thoughts in a timely manner. The online system records prediction of online shopping behavior. For example, Xg-
consumer operations and collects consumer behavior data, boost [6] hybrid model, deep neural network, and naive
making it possible to predict consumers’ buying preferences. Bayes model have achieved good results in consumer
This article takes the real unbalance shopping data of the e-
behavior prediction. The above models can effectively and
commerce platform as the research object, and uses the cat-
boost model to analyze and predict whether consumers will
quickly classify and judge large-scale data sets. However, in
purchase a certain product. The accuracy; precision and some actual customer data, there are often a small number of
other criterion of the model are given to evaluate the customers with actual consumption behavior. Imbalances in
performance of the prediction. A better effect is obtained: the the data sets easily cause over-fitting. We hope to identify
accuracy reach 88.51% in predicting purchase behavior in this and screen effective purchase behaviors among a large group
data set. of potential customers to reduce marketing costs. In the case
of this imbalanced data, using cat-boost to identify actual
Keywords-Online Purchase; Cat-boost; Classification purchase users is a better choice.
Boosting model is an ensemble learning method that
I. INTRODUCTION reduces bias in supervised learning and effectively improves
Online purchase brings great convenience to consumers classification accuracy. The supervised learning bias is
and merchant, while effectively reducing the circulation and reduced through the construction and fusion of prediction
transaction costs of goods, and providing a wider market for functions, the model reduces the loss during the iteration,
both parties to the transaction. From the perspective of the and the fusion of the basic classifier improves the accuracy.
merchant, analyzing consumer purchase intentions and Cat-boost [7] is an open source gradient enhancement
studying the influencing factors that influence their algorithm (Dorogush et al., 2018). It allows users to quickly
purchasing behavior can provide consumers with targeted process the classification features of large data sets, which
services and advertising recommendations; identify and can be used to solve regression problems; classification and
locate potential customers, increase market share and ranking problems. Especially in the prediction of unbalanced
promote transaction completion. It is of great significance to data sets, it outperforms the performance of previous Light-
realize the sustainable development of e-commerce platform. GBM and Xg-boost and other models [8]-[11].
Based on the records of user clicks, browses, and related Superiority of cat-boost: The use of a symmetric tree
product purchases, a data-based user behavior preference method makes up for the lack of previous enhancement
analysis system has been constructed and has been applied to algorithms in processing category features, improves the
major online shopping platforms. R.J. Kuo proposed to use robustness of prediction results, especially classification and
ART2 neural network and k-means to cluster user behavior regression in unbalanced data, and can reduce the error
patterns, simulate customer browsing paths, and analyze and caused by over-fit. This article uses a total of 12331 potential
predict potential customers' shopping preferences [1]. Qiu user browsing behavior records recorded by advertising
Jiangtao established a two-stage model: First, the set of products on an e-commerce platform to classify and predict
products to be selected is given through association rules, users' purchase operations.
and then they use the support vector machine and II. CAT-BOOST MODEL
hierarchical Bayesian discrete model are used to determine
customer preferences [2]. Cui. D compares the support A. Model Expression
vector machine with the logit model, and analyzes the Gradient lifting is an effective machine learning method
prediction accuracy of the model for sales targets in the that can obtain excellent results in practical situations such as
marketing business [3]. SMC attributes consumer purchase behavior prediction and some classification problems with a
behavior to two independent variables, the purchase rate and large number of features, especially in the problem of
the churn rate, and obeys the gamma distribution. Based on
this, it makes predictions about customer shopping behavior

978-1-7281-6024-5/20/$31.00 ©2020 IEEE 532

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 26,2020 at 04:32:22 UTC from IEEE Xplore. Restrictions apply.
processing heterogeneous features, noisy data and complex as one-hot coding to convert categorical variables into
explanatory variables Also has excellent performance. numerical variables, which will cause deviations between the
Cat-boost is a gradient enhancement technology transformed data distribution and the original data, and
proposed by Yandex. Cat-boost is proposed based on the cannot explain the meaning of the variables in the original
gradient enhancement tree and is combined with a logistic data. To overcome this gradient bias, cat-boost proposes an
regression model. In each iteration, the gradient orderly enhancement method that can reduce the over-fitting
enhancement method is used to modify the original of the model, so that the processed data has practical
algorithm base. The loss function calculated from the same significance and can be interpreted.
data set continuously decreases and the loss has reached a
better result of making full use of the data. This method III. ONLINE PURCHASE BEHAVIOR PREDICTION
obtains an unbiased estimation of the gradient, which A. Variables Analysis
effectively improves the generalization ability of the model,
This data set mainly focuses on tracking consumer
so that it can accurately identify a small group of samples
advertising recommendations and browsing behaviors of
when processing unbalanced data.
product details, obtaining consumer behavior preference data,
The key steps of the model are shown below:
and making judgments on potential customers' consumption
Whether the flight delays studied in this article is a binary
behavior. The data contains a total of 12316 valid records
problem. At this time, the loss_function of model is:
and 17 variables, including one categorical variable. The
L( y, f ( x)) log(1  exp(  yf ( x)) (1)
specific variable information is shown in the following table:

The value of the above loss function is in the range of (- TABLE I. SYMBOLS DESCRIPTION
1,1). It can be seen that the reverse derivative is:
variables type description
rti yi /(1  exp( yi f ( xi ))) (2) Administrative float64 Administrative Page Number
Administrative Visit Time of Administrative
An approximate estimate of each tree node is: float64
Duration Page(s)
ctj ¦ r / ¦ r (1 r )
xi Rtj
ti
xi Rtj
ti ti
Informational
Informational
float64 Informational Page Number
Visit Time of Informational
(3) float64
Duration Page(s)
Product Related float64 Product Related Page Number
For categorical variables, cat-boost uses the ordered Product Related
boosting method. That is, suppose the original categorical Duration
float64 Visit Time of Product Related(s)
variable order is: Visitors Leave Without
Bounce Rate float64
t (t1 , t2 ,...tn ) , Randomly traverse the whole random Triggering Any Other Requests
The Percentage That Were The
sequence, and calculate the value of the first p categorical Exit Rate float64
Last In The Session
variable records . t p ,k can be present as : Average Visit Value For A Page

¦
p Page Value float64 Before Completing An E-
j 1
[ x j ,k xi ,k ] ˜ Yi  a ˜ P commerce Transaction

¦
n e.g. Mother’s Day, Valentine's
j 1
[ x j ,k xi ,k ]  a (4)
Special Day Int64
Day
Month float64 Month(1-12)
Among above, the prior value P and a ! 0 is parameters to Operating Systems float64 Purchase System
reducing noise in data set. In this way, categorical variables Browser int64 Online Browser Type
that do not have an order or level relationship can be well Region Int64 Customer Region
transformed. Traffic Type Int64 Traffic Situation In That Day
Return Visitor Or First Time
B. Advantages Visitor Type Int64
Visit Or Other
Cat-boost outperforms traditional integrated models in Weekend Bool Weekend Day
classification prediction problems for the following reasons:
1. Omitting non-numeric variables. During the training of Special day refers to a period before a special holiday,
the model, the model can automatically process categorical consumers may be able to complete transactions, The value
variables, which reduces the tedious steps of processing of of this attribute is determined by considering the dynamics of
the previous data and reduces the degree of information loss. e-commerce such as the duration between the order date and
2. Combine all variables to generate new variables. When delivery date. For example, for Valentin's day, this value
the decision tree is segmented, a greedy algorithm is used to takes a nonzero value between February 2 and February 12,
construct new and as many features as possible. The increase zero before and after this date unless it is close to another
of classification nodes enables the information of the original special day, and its maximum value of 1 on February 8.
data set to be fully mined. The interpreted variable is the user's purchase behavior, 0
3. The symmetric tree solves the problem of over-fitting. or 1 are used to represent whether the user purchased the
The traditional gradient boosting model uses methods such product or not. Establish a Cat-boost classifier model based

533

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 26,2020 at 04:32:22 UTC from IEEE Xplore. Restrictions apply.
on known purchase behavior data to predict future user by model fitting. Through feature scoring, qualitative
behavior preferences. judgment is made on the importance of features for
classification selection.
B. Descriptive Statistics
In actual online purchase situation, the audience for sales
advertisements and product details interfaces is wide, and
actual buyers only account for a small part of them. The
online purchase data set is an imbalanced categorical data as
shown in the figure below.

Figure 3. Feature importance score for flight delay data

As can be seen from the above figure, the quotations of


web pages that potential customers browsed before actual
Figure 1. Actual purchase situation
consumption have a great impact on whether or not they
From the perspective of independent variables, in general, ultimately purchase, and consumer behavior also has a strong
consumers who browse the product multiple times or are seasonality, with different months having a greater impact on
affected by advertising tend to be more inclined to purchase, purchases. Special festivals, detailed interface browsing,
and the number of purchases before holidays is significantly operating systems, and browser choices have less influence
higher than that of working days. At the same time, it may be on the purchase choices of this product.
affected by other factors such as the product purchase system, C. Evaluation Criteria
the browser used by the consumer, and the region where the
The evaluation criteria of the model are very important
consumer is located.
for the measurement of the final result. In different scenarios,
in order to focus on the accuracy of different prediction
categories, different indicators such as accuracy, precision,
and recall need to be used for measurement. Accuracy refers
to the proportion of all samples that are correctly classified.
Precision refers to the proportion of samples that are judged
positive. Recall is the proportion of actual positive samples.
Since the probability of flight delay is much smaller than the
probability of flight arriving on time, in actual research, we
tend to focus on the accuracy of on-time flights, that is,
accuracy and accuracy. At the same time, the accuracy of the
model must be higher than the accuracy when all samples are
judged to be the majority, so the efficiency of the model is
better than the random decision result, which is called Zero-
R classifier. The Zero-R classifier provides a reference for
classification to ensure that prediction accuracy is higher
than classifying all variables to be interpreted into the
category with the largest proportion.
TABLE II. CONFUSION MATRIX

Positive Negative

True TP TN
Figure 2. Possible important influence variables
False FP FN
Cat-boost can sort and score the importance of data
features during the model establishment process, and In binary classification problems, the threshold of choice
automatically select more valuable features for the model. decision directly affects the accuracy of the obtained
Therefore, the important features are selected and remodeled classification results. The method of determining the

534

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 26,2020 at 04:32:22 UTC from IEEE Xplore. Restrictions apply.
decision threshold through the ROC-AUC image and the The AUC-ROC and FPR-FNR images of the model on
FPR-FNR image is relatively intuitive, and can be divided the test data set after adjusting the parameters are shown
according to more general rules when the model is below:
popularized. The ROC curve is the receiver's working
characteristic curve, which reflects the tendency of the
classification sensitivity accuracy of the model at different
thresholds. AUC represents the area under the roc curve. The
horizontal axis is the negative positive classification rate
FPR, which represents the probability of dividing a true
negative sample into positive samples, the vertical axis is the
accuracy rate, and the AUC is the area formed by the ROC
and the x-axis. The FPR-FNR curve shows the possibility of
mistakes in classification of positive or negative samples
with different thresholds, focusing on the results of
Figure 4. AUC-ROC curve of purchase prediction
misjudgments.
According to the requirements of the binary forecast of
Online Purchase Statistics data and the unbalanced data
characteristics of the samples, AUC-ROC and F1 were
selected as the model's pros and cons. At the same time, the
ZeroR is used as the threshold to check the validity of the
model. The evaluation criteria involved above are calculated
as follows:
TP  TN
accuracy
TP  TN  FP  FP (5)
TP
precision Figure 5. FPR-FNR curve of purchase prediction
TP  FP (6)
TP In order to compare the prediction results obtained by
recall different threshold values, the threshold values are calculated
TP  FN (7) at intervals of 0.1 between 0 to 0.3 to calculate accuracy and
FP F1. The results are shown in the following table:
FPR
FP  TN (8) TABLE IV. ACCURACY AND F1 AT DIFFERENT THRESHOLD

FN
FNR threshold 0.1 0.2 0.3
TP  FN (9)
Accuracy 0.4533 0.8851 0.8742
2 * precision * recall
F1 F1 0.6739 0.9060 0.8305
precision  recall (10)
So we choose 0.2984 as the decision threshold, and the
ZeroR CorrentlyClassifiedI ns tan ces 68.65% (11) final prediction accuracy is 88.51%. At this time, the
prediction performance is optimal. At the same time, the
Among them F1 is the weighted harmonic average of prediction accuracy of effective purchase is high, which is
precision and recall, the value range is F1 [0,1] , and when it beneficial to make targeted sales with merchants. The
is 1, the model is optimal. This experiment is based on evaluation indicators under this threshold are as follows:
Python 3.6 version and Cat-boost package for practice.
TABLE V. EVALUATION PARAMETERS FOR PREDICTION
The original data set test set contains a total of 12,316
valid records, of which 10303 are no-purchased records
Evaluation parameters Value(%)
(majority cases), and 1889 are actually purchased records
(fewer cases). The actual prediction result confusion matrix Accuracy 88.51
is as follows:
Precision 97.67
TABLE III. CONFUSION MATRIX FOR PURCHASE PREDICTION Recall 84.48
Positive Negative FPR 15.16

True 9098 1214 FNR 15.52

False 217 1672 F1 90.60

535

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 26,2020 at 04:32:22 UTC from IEEE Xplore. Restrictions apply.
IV. CONCLUSIONS [2] Qiu J, Lin Z, Li Y. Predicting customer purchase behavior in the e-
commerce context[J]. Electronic Commerce Research, 2015.
Based on the potential user behavior preference data of a [3] Cui D, Curry D. Prediction in Marketing Using the Support Vector
certain product, this paper uses the catboost classification Machine[M]. INFORMS, 2005.
algorithm applicable to the imbalanced data set to identify [4] Schmittlein D C, Peterson R A. Customer Base Analysis: An
the actual purchase users, and obtains an accuracy of 88.51% Industrial Purchase Process Application[J]. Marketing Science, 1994.
and a recall rate of 84.48. The model effectively reduces the [5] Chen W, Li Z, Zhang M. Linear and Non-Linear Models for Purchase
common overfitting problems in imbalanced data fitting Prediction[C]// the 2015 International ACM Recommender Systems
through symmetric trees, and adopts a more scientific and Challenge. ACM, 2015.
interpretable coding method for categorical variables, which [6] Chen T, Tong H, Benesty M. xgboost: Extreme Gradient Boosting[J].
2016.
reduces the information loss during model establishment and
improves Robustness of the model. The effective customer [7] Dorogush, Anna Veronika, Ershov, Vasily, Gulin, Andrey. CatBoost:
gradient boosting with categorical features support[J].
identification model based on the cabost classification
[8] Pornpimon Kachamas, Suphamongkol Akkaradamrongrat, Sukree
algorithm is targeted at merchants, designing personalized Sinthupinyo, and Achara Chandrachai, "Application of Artificial
recommendation systems, and reducing marketing costs is of Intelligent in the Prediction of Consumer Behavior from Facebook
great significance. This paper makes classification prediction Posts Analysis," International Journal of Machine Learning and
from the perspective of imbalanced data. The prediction Computing vol. 9, 2019.
accuracy, especially the recognition of a few categories, [9] Alexiei Dingli, Vincent Marmara, and Nicole Sant Fournier,
needs to be improved. In the future, in-depth research can be "Comparison of Deep Learning Algorithms to Predict Customer
Churn within a Local Retail Industry," International Journal of
made on the prediction of purchases of multiple categories of Machine Learning and Computing vol. 7, 2017.
products, making real-time predictions and personalization [10] O. Ayad and M. Syed-Mouchaweh, "Multiple Classifiers Approach
of users' browsing preferences. based on Dynamic Selection to Maximize Classification
Performance," International Journal of Machine Learning and
REFERENCES Computing vol. 1, no. 2, pp. 154-162, 2011
[1] Kuo R J, Liao J L, Tu C. Integration of ART2 neural network and [11] Haitao Yu, "Classification Performance Comparison of Feature
genetic K-means algorithm for analyzing Web browsing paths in Vectors Based on Summation Scheme and Maximization Scheme,"
electronic commerce[J]. Decision support systems, 2005. International Journal of Machine Learning and Computing vol. 1,
2011.

536

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 26,2020 at 04:32:22 UTC from IEEE Xplore. Restrictions apply.

You might also like