Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2017 27th International Telecommunication Networks and Applications Conference (ITNAC)

User Behavior Analysis Based on User


Interest by Web Log Mining
Xipei Luo1,2, Jing Wang1,2, Qiwei Shen1,2, Jingyu Wang1,2, Qi Qi1,2
1
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and
Telecommunications, Beijing, 100876, P.R.China
2
EBUPT Information Technology Co., Ltd., Beijing, 100191, P.R.China

Abstract: With the rapid development of science and introduces the user’s browsing behavior model
technology and the growing popularity of computer construction. Section 5 introduces the M5 model based on
networks, the scale of network users is gradually expanding, the user behavior analysis method. Chapter 6 carries on the
and the behavior of network users is becoming more and experimental analysis to the model. Chapter 7 concludes
more complicated. A large number of studies show that the the paper.
user’s actual interest is closely related to the browsing
behavior on the web page. Through the user browsing II. RELATED WORK
behavior analysis can obtain the user interest information,
and then build the user interest model, so that the search
The traditional data mining technology and web
results closer to the user's expectations. This paper mainly integration for web mining. Web digging is the extraction
introduces the method of web log mining, which can of interesting, potentially useful patterns and hidden
discover the mode of web pages by digging web log records. information from web documents and web activities. Web
By analyzing and exploring the rules of web log records, we data mining is the use of data mining technology,
can identify the potential customers of the website and automatically discover and extract information from web
improve the quality of information services to users. In the documents and services.. In general, according to the
stage of user behavior analysis, this paper explores the different data mining object, web data mining is divided
differences in user browsing behavior in different types of into three types: web content mining, web structure
access events, and calculates the user's interest based on the mining, web log mining. [9].
M5 model tree to analyze the analytic events.
User interest modeling is the process of summing up a
Keywords: network user behavior; data mining; user interest; computable user interest model from information that can
M5 model tree reflect user preferences. Researchers have done a lot of
work in the area of user interest modeling and have made
I. INTRODUCTION a lot of valuable results. MariamDaoud and other
With the rapid development of the Internet to researchers, based on the ontology, proposed a user
information access, communication and communication- interest modeling method based on semantic graph [17].
based basic network services are gradually developed for Hochd Jeon proposed a method of using adaptive updating
the leisure and entertainment, electronic services, e- strategy for adaptive user interest modeling [18].
commerce services to expand the three major categories of Adomavicius and Tuzhilin used the data mining method
network services. Users in the visit of the page and the to mine the access records of the individual users, excavate
user's interest is closely related, such as the care of the association rules and the user registration of the
financial users will often visit some financial class of the personal information constitute the user model. Sofia
site, and like sports users will often visit some sports news Stamou and Alexandros Ntoulas proposed a method of
sites or sporting goods website. We can use the user's visit user interest modeling by analyzing query terms and web
records to tap the user's interest in a topic. page subject information. Researchers such as Paul
conducted user interest modeling through ODP
The main purpose of the study of the user's visit record classification system and data information.
is to analyze the user's most concerned about the results
from the mining results. By analyzing the user access to The M5 model tree is proposed by Quinlan to solve the
resources of the time, frequency and so on,modify the problem of continuous value prediction. The M5 model
structure and design of the site to expect more customers tree is a decision tree that uses linear regression function
to stay and better serve customers. User behavior analysis at leaf nodes. By using a standard method to convert the
has become a new research hotspot. The work of this paper classification problem into function optimization problem,
mainly studies the data mining technology in user the final model can be expressed as a piecewise linear
behavior analysis, and builds the user interest model based function. M5 algorithm can be divided into model tree
on the user interest information, and finally draws the construction and linear regression model 
user's interest. The above theory and model algorithm help us to use
The rest of this paper is organized as follows. Section web log mining technology to obtain user access behavior
2 describes the relevant work of the paper. Section 3 data, and on this basis, based on the M5 model tree to
introduces the Web log mining technology. Section 4 construct the user behavior model for user interest.

978-1-5090-6796-1/17/$31.00 ©2017 IEEE


2017 27th International Telecommunication Networks and Applications Conference (ITNAC)

III. WEB LOG MINING visualization techniques, database query mechanisms,


Web log mining refers to extracting patterns of interest mathematical statistics, and usability analysis [7].
from Web usage records, at present, there are many IV. THE USER BEHAVIOR MODEL FOR BROWSING WEB
researches on Web log mining, WWW in each server to
retain the access log, recorded on the user access and The basic behavioral event is the most basic data
interactive information, you can identify potential Web element for user behavior analysis. It can be expressed as
users by analyzing and investigating the rules in Web log a triple BBE=Name,Timestamp,URL.Where: BBE
records. [1]. represents the basic behavior event collected by the
browser plug-in, Name indicates the name of the behavior
The overall process reference [6] is shown as Fig. 1. event, Timestamp indicates the timestamp of the basic
event, and the URL indicates that the behavior event is
specific to which page, and the item is optional. [8].
Analysis of the behavioral event set; analysis of the
event data set(ABES) is the basic action event that is
processed to form data that facilitates user behavior
analysis, analysis of the event data set is defined as
follows: M PA   B   C    HEA   H Q. Where:A to
 H is the user action event, and the analysis event data set
is the basis for user behavior analysis. These events
include: page dwell time, mouse clicks, page re-visits and
Fig.1 Web log mining process
the number of slider movements [10]. The user's interest
(1) Data preprocessing phase indicates the degree to which the user is interested in a
According to the purpose of mining, extract, page with certain characteristics. In order to distinguish
decompose, merge the data in the original Web log file, the user's interest in different web pages, you need to
and finally convert it to a user session file. This stage is calculate a value for each user's page, which is the degree
the most critical stage of Web access information mining. of interest. [12]. This article believes that if the user visits
Data preprocessing includes preprocessing on user access a web page, it may be interested in it [13].
information and preprocessing of content and structure. The user behavior model stores the degree of interest
of the user to the page predicted by the set of analyzable
(2) Session recognition phase
events of the user's behavior. The user behavior model can
This phase was originally part of the data be expressed as {UserID, ABES, Interest}. Where: UserlD
preprocessing phase, here it is divided into a separate stage is used to identify the user, ABES is the set of analyzable
because the user session file is divided into a group of user events of the user's behavior, these events include: page
session sequence will be directly used for mining dwell time, mouse clicks, page re-visit times and the
algorithm, and its precision directly determines the mining number of slider movements. Interest indicates the user's
results are good or bad. It is the most important stage in interest in the page. In this paper, the M5 model tree is
the excavation process. used to analyze the data of users' browsing data [5].
(3) Pattern discovery phase V.USER BEHAVIOR ANALYSIS METHOD BASED ON M5
Pattern discovery is the use of a variety of methods MODEL
and techniques from the Web log data, digging and
discovering the various potential laws and patterns of the In this chapter, we first give a definition of the
user using the Web. Pattern discovery uses algorithms behavioral events used to describe the user behavior
and methods not only from the field of data mining, but characteristics. Secondly, we give the process of analyzing
also includes other areas of expertise such as machine the user behavior for the user's interest, and the process of
learning, statistics, and pattern recognition. The main calculating the interest of the user's page based on M5.
techniques of pattern discovery are: statistical analysis, A The definition of behavioral events
association rules, clustering, classification, sequential In order to facilitate the analysis and induction of user
patterns, dependency. behavior interest, this paper considers four important
(4) pattern analysis factors: the page stay time, the number of mouse clicks,
Pattern analysis is the final step in Web usage mining, the number of page visits, the number of slider movements.
the main purpose is to filter rules and patterns generated The user behavior model requires a numeric value to
by Pattern Discovery phase, remove those useless patterns, represent the user's interest in a web page. The following
and through a certain method to find the pattern of visual gives a definition of the analytic event. The page dwell
performance [4]. Since Web usage mining is mostly time refers to the time from when the user enters this page
biased in most cases, it is possible to dig out all the patterns to close this page. The number of mouse clicks refers to
and rules, so it can not be ruled out that some of these the user using the left click and the right to save all the
patterns are common sense, ordinary or end users are not operations while browsing the page. Each time the user
interested, It is necessary to use the method of pattern opens a new page will be recorded once, the user
analysis to make the excavation of the rules and repeatedly open the same page, traffic is increasing, the
knowledge is readable and ultimately understandable. number of page visits is the total number of views of the
Common pattern analysis methods are graphical and
2017 27th International Telecommunication Networks and Applications Conference (ITNAC)

page.The number of slider movements is the sum of all


operational events associated with the roll page [11].
An example of an analysis of the event set Data is (3)
shown in Table 1:
Where ci represents the corresponding value of the
TABLE 1 EXAMPLES OF USER BEHAVIOR EVENT SETS i-th sample on the attribute, and N represents the sample
size. P and N-P represent the two sets of sample sizes
        divided by ci.
    ! !   4.In the fourth step, for all the packets of all input
attributes, the maximum value of the variance reduction
      is found. The corresponding splitting attribute is j and the
       splitting value is SplitValue. The split attribute and
      splitting value of the current node are j and SplitValue
respectively.
      
5.Step 5, Split the node. For the sample matrix data,
     
according to the split attribute value will be divided into
 
     two left and right matrix, were respectively sent to the left
      and right branches.
6.Step 6, in accordance with the same method split
B User behavior analysis process
the left node, turn the second step of the stage; split the
The process includes establishing a basic decision right node, do the same treatment.
tree, establishing a linear regression model for the leaf
7.In the seventh step, if the current node is a leaf
nodes, pruning the model tree, and smoothing the linear
regression coefficients [14]. The specific process of node, the node has no splitting attribute, and its linear
calculating user interest based on M5 model is as follows: regression model is the average of the class attribute
arriving at this node sample. If the current node is not a
1) Build a basic decision tree: This phase is the core leaf node, it is used to establish the attribute of the linear
of the M5 model. regression model of this node.
1.The first step is to calculate the variance 2) Linear regression modeling. The specific process
Deviation of the class attribute values of the sample is as follows [15]:
matrix Data according to (1). From the root node began 1.The first step, the data according to equation (4)
to split. expressed in the form of equations.
 #   #  M  K  (4)
!.;0+9054 M 0M ,0 L# 0M ,0 (1)
#
Where: Y is the column vector that the user
Where ,J represents the corresponding value of manually scoring the page's interest, ? includes the
the i-th sample on the attribute, and N represents the unobserved random component, X is the observed value
sample size. matrix of the regression, > is the coefficient of X, after
2.In the second step, the matrix of the samples the determination, the linear regression model is
arriving at the current node is data (data is the matrix of n completed.
* m). If the number of lines of data is less than 4, or the 2.The second step, according to the formula (5) to
variance of class attribute value is less than 0.05 * complete the prediction of >;
Deviation, then the node is the leaf node, split end,
" ) I  ) EA
) I * (5)
otherwise, turn the third step.
3.The third step, the data input attribute j (j = l ... = M >@ K >A '03.$4%+/. K >B (089'03.8 K
4), j said TimeOnPage, VistTimes, ClickNum, ScrollNum >C 20,1#:3 K >D &,7522#:3 (6)
in one. The data in accordance with the size of j from
We call formula(6) empirical linear regression
small to large sort, get newdata. The samples of newdata
equation. When the user's basic behavior events actually
can be divided into two groups according to the size of
get the corresponding value of the four independent
attribute j, with p and n-p samples, respectively. Find the
variables, into the formula(6) to get the user's interest in
split value SplitValue and variance reduction SDR
the page.
according to formulas (2) and (3). SplitValue:
3) pruning of trees. This stage mainly completes the
&
 M N4.<-+9+N6 0O K 4.<-+9+N6 K pruning of the basic decision tree from bottom to top, the
 0OO         N2O            steps
concrete     are:

1.In the first step, if the current node is a leaf node,
Where 4.<-+9+ 6 0 is the pth value of the i-th no pruning is performed; if the current node is not a leaf
attribute in newdata and 4.<-+9+ 6 K  0 is the p + 1 node, the left and right branches are pruned and
value of the i-th attribute in newdata. transferred to the second step.
Variance reduction SDR:
2017 27th International Telecommunication Networks and Applications Conference (ITNAC)

2.In the second step, a linear regression model is B Experimental data


established by tracing the sample of the current node and In order to verify the validity of the method based on
its partial (or all) linear regression attributes. Traverse all the behavior of the user, the experiment is carried out. First,
linear models and selecting the model that minimizes the capture the user's browsing behavior, record the user to
sample error of the current node as the linear regression browse the page. The recorded information is a basic event
model of the current node. Compare the error generated set for user behavior events, including web page
by the linear regression model of the current node with information and user behavior. Second, ask the user to
the subtree generated by the subtree of the node. If the mark the interest of each page. In this experiment, we need
error of the linear regression model of the current node is to compare the difference between the user's interest and
small, the subtree of the current node is cut, leaving only the user's real interest, and therefore need the user to mark
the current node, otherwise, keep the subtree of the the interest of browsing the page.
current node. This paper focuses on the analysis of two behavioral
3.In the third step, if the parent node of the current events: the number of page stay and the number of re
node is not empty, its parent node is set as the current access. This paper installs a self-written browser plug-in
node, and the parent node is pruned and the second step on the browser on 10 PCs in the lab, collects user behavior
is completed. If the parent node of the current node is for more than 3000 web pages, and the score distribution
empty, the pruning ends. of the result set is shown in Table 2
4.Step 4, set the leaf node number of the tree,
TABLE 2 WEB'S SCORE RESULT SET PAG
prepare for the smoothing of the coefficients.
4) Smoothing coefficient. Which smoothes the linear
regression model of all leaf nodes:
1.In the first step, the leaf nodes of the tree are
smoothed according to the number, and the current leaf
node is the current node.
Converting the basic behavioral event set
2.In the second step, if the parent node of the corresponding to more than 3,000 articles to an analytic
current node is not empty, the linear regression model of behavior event set. This paper focuses on two analytic
the parent node is used to smooth the linear model of the behavior events, including the residence time of the page
current leaf node. The properties of the model after the and the number of slidesPage dwell time
smoothing are the properties of the current model of the
current leaf node and the properties of the parent node
model of the current node. The parent node of the current
node is set as the current node,redo the step operation; if
the parent node of the current node is empty, smooth end,
get the current leaf node smooth model [16].
VI. EXPERIMENT AND RESULTS
A Based on M5 's User Behavior Analysis Model
Based on the above analysis model, we use all the
collected user behavior data constructed a general
common prediction model based on the Tree model M5, Fig. 3 User-Dwell times distribution
the models obtained are shown in Fig. 2.
Page re-visit times

Fig. 4 User-Re-visit times distribution

C Analysis of results
Calculated page interest, the model can be evaluated
and the accuracy of the page is
Fig. 2 General Prediction Mode
2017 27th International Telecommunication Networks and Applications Conference (ITNAC)

,"49.7.89L"49.7.89 [1] Wan Fei, Zhao Xi, Liang, and so on. Research on Search Engine
%+/.%7, M  L     7
,"49.7.89 User Behavior Based on Mobile Internet Log. Chinese Journal
of Information, 2014,28 (2): 144-150.
Where PagePrc is the accuracy of the page, Aclnterest [2] Cen Rongwei, Liu Yiqun, Zhang Min, et al. Search Engine User
is the user's actual interest in the page, determined by the Behavior Analysis Based on Log Mining. Chinese Journal of
user's manual score. Interest is the user's interest analysis Information, 2010,24 (3): 49-54.
model to infer the user's interest in the page. The accuracy [3] Rong Guoting, Luo Yong, Sun Jianjun. Research on Library
of the user behavior model for a particular user is the User 's Behavior Based on Log Analysis. Library Journal, 2015
(7): 59-63.
average of the user's access to the page accuracy. [4] N. Ghahreman and M. Sameti, “Comparison of M5 model tree
A H and Artificial Neural Network for estimating Potential
 
 M  GF@ 
  (8) Evapotranspiration in semi- arid climates,” Department of
H
Irrigation and Reclamation Engineering, University of Tehran,
Where Accuracy represents the accuracy of the model Karaj, Iran, March 2014.
to the user and n is the number of pages visited by the user. [5] P. Ditthakit and C. Chinnarasri, “Estimation of Pan Coefficient
using M5 Model Tree”, School of Engineering and Resources,
This article will ask the user to mark interest after the Walailak University, Nakhon Si Thammarat 80160, Thailand,
user has finished browsing. Interest degree values include 2012.
0 ~ 3 five grades. Again, according to the interest rate [6] Yan, Q.,Wu, L.,Zheng, L. et al.Social network based microblog
estimation method to calculate the page interest rate, and user behavior analysis[J].Physica, A. Statistical mechanics and
its applications,2013,392(7):1712-1723.
compared with the results of user labeling, get the [7] Zhenhua Wang,Lai Tu,Zhe Guo et al.Analysis of user behaviors
accuracy of the model on the page. Finally, the accuracy by mining large network data sets[J].Future generations
of the model for each user is calculated from equation (7) computer systems: FGCS,2014,37:429-437.
and (8). [8] Yun Liu,Weiguo Yuan.User Posting Behavior Analysis and
Modeling in Microblog[C].//2014 Tenth International
Based on all the behavior data of each user collected, Conference on Intelligent Information Hiding and Multimedia
the general forecasting model of all the data is constructed Signal Processing: 2014 Tenth International Conference on
based on the M5 model tree. The accuracy rate of the Intelligent Information Hiding and Multimedia Signal
Processing (IIH-MSP 2014), 27-29 August 2014, Kitakyushu,
corresponding model is obtained according to the formula Japan.2014:916-919.
(7) and (8)The accuracy rate is shown in Figure 5. [9] Yin B, Zhang Z, Wang X, et al. Research and Application of
Data Mining Technology Used in the Analysis of Smart Home
User Behavior[C]// Sixth International Conference on
Measuring Technology and Mechatronics Automation. IEEE,
2014:476-479.
[10] Hájek P, Stejskal J. Library user behavior analysis - Use in
economics and management[J]. Wseas Transactions on Business
& Economics, 2014, 11(1):107-116.
[11] Jaewon Kim,Paul Thomas,Ramesh Sankaranarayana et al.Eye-
Tracking Analysis of User Behavior and Performance in Web
Search on Large and Small Screens[J].Journal of the Association
for Information Science and Technology,2015,66(3):526-544.
[12] Long Chen,Yong-Qing Wang.Forensic Analysis towards the
user behavior of Sina microblog[C].//International conference
on education technology, management and humanities science:
ETMHS 2015, Xi an, China, 21-22 March 2015, Part 2 of
Fig. 5 The accuracy results 2.2015:1167-1171.
[13] Mi Zhang,Christopher C. Yang.Using Content and Network
As can be seen from Figure 5, average for all users' Analysis to Understand the Social Support Exchange Patterns
accuracythe accuracy of the general forecasting model and User Behaviors of an Online Smoking Cessation
is 65.2%, so the user behavior analysis method based on Intervention Program[J].Journal of the Association for
the M5 model can be applied to the prediction of the user's Information Science and Technology,2015,66(3):564-575.
[14] Behnood, Ali,Olek, Jan,Glinicki, Michal A. et al.Predicting
interest to a certain extent. modulus elasticity of recycled aggregate concrete using M5 '
model tree algorithm[J].Construction and Building
VII. CONCLUSION Materials,2015,94(Sep.30):137-147.
User behavior analysis is through the way of data [15] Nitha Ayinippully Nalarajan,C. Mohandas.Groundwater Level
Prediction using M5 Model Trees[J].Journal of The Institution
mining from a large number of network information
of Engineers (India), Series A. Civil, architectural,
mining user behavior patterns. It is a relatively new environmental and agricultural engineering,2015,96(1):57-62.
research field, has a wide range of application prospects, [16] JIA Ming-ming.New Method for Generating Model Tree [J].
become a hot topic of domestic and foreign scholars. Software Journal, 2008 (04): 35-37.Teevan, J.,Dumais,S.T., and
Horvitz,E. (2010). Potential for personalization. ACM.Interact.
This paper mainly studies the behavior analysis 17,1,1-31.
method of user interest, studies the Web log mining [17] Jansen B J, Booth D L,Spink A, Determining the informational,
method, and puts forward the calculation degree of user's navigational, andtransactional intent of Web queries [J].
interest in web pages based on M5 model. The process of Information Processing & Management, 2008,44(3): 1251-1266.
[18] Adomavicius G and Tuzhilin A. Using Data Mining Methods to
constructing user behavior model based on M5 is Build Customer Profiles.IEEE Computer. Feb 2001:74-82.
introduced in detail. Finally, we use the collected user
behavior data to construct a general model of user interest
forecast.

You might also like