Professional Documents
Culture Documents
Human Web Behaviour Mining
Human Web Behaviour Mining
ABSTRACT
Extraction, synthesis, and analysis of human behavior in electronic environments are of central importance in web
behavior mining, user modeling, personalization, and recommender systems. We present a novel framework for
elucidating human dynamics in electronic environments. It efficiently captures the spatiotemporal dimensions of human
interactions, and enables the extraction and analysis of the elemental and complex browsing patterns. The framework was
applied to exploration of knowledge worker browsing behavior on a large corporate intranet. Significant tendency to form
repetitive elemental and complex browsing patterns has been revealed. Knowledge workers exhibited diminutive
exploratory behavior and underutilization of available resources. All analyzed aspects of browsing behavior displayed
evident long tail characteristics.
KEYWORDS
Web behavior mining, behavioral abstractions, navigation space, long tails, knowledge workers.
1. INTRODUCTION
Exploration and analysis of human dynamics in electronic environments is rapidly gaining eminent position
in web research and datamining. Effective mining and accurate modeling of human web navigation patterns
and their abstractions has a great significance in engineering and development of recommender systems [1],
collaborative filtering engines [2], behavioral clustering, and personalization [3], [4].
Importance of the human web behavior analysis attracted wide spectrum of research activity. The body of
reported work includes mining clickstream data of page transitions [5], improving web search ranking
utilizing user behavior information [6], [7], web page traversing using eye-tracking studies and devices [8],
and commercial aspects of user behavior analysis [9], [10]. Substantial portion of the research has been
focused on deriving models of user navigation with predictive capabilities – for designing automated
predictors used in recommender systems, collaborative filtering engines, and pre-caching in web servers.
Statistical approaches have been favoring Markov models [11]. However, higher-order Markov models
become exceedingly complex and computationally expensive. Computationally less intensive cluster analysis
methods [12] and adaptive learning strategies [13] have scalability drawbacks. Mining only frequent patterns
reduces the computational complexity and improves the speed, however, at the expense of substantial data
loss [14]. Further progress and more efficient approaches require deeper quantitative and qualitative
understanding of human behavior in electronic environments.
In this paper we focus on knowledge worker user populations and elucidate their browsing behavior on
the large corporate intranet. We present a new analytic framework for exploration and modeling. The
framework allows observation of elemental and complex behavioral patterns and abstractions. We provide a
detailed analysis of characteristics of human behavioral abstractions together with the analysis of the
essential navigation points in the web space.
163
ISBN: 978-972-8924-44-7 © 2007 IADIS
2. APPROACH FORMULATION
Clickstream sequences [15] of user page transitions are divided into sessions, and sessions are further divided
into subsequences. Division is done with respect to the user activity and inactivity. Consider the conventional
time-stamp clickstream sequence of the following form: {(pi, ti)}i, where pi denotes the visited page URLi at
the time ti. For the purpose of analysis this sequence is converted into the form: {(pi, di)}i where di denotes a
delay between the consecutive views pi → pi+1. User browsing activity {(pi, di)}i is divided into subelements
according to the periods of inactivity di satisfying certain criteria.
Definition 1 (Browsing Session, Subsequence, Train)
Let {(pi, di)}i be a sequence of pages pi with delays di between consecutive transitions pi → pi+1.
Browsing session is a sequence B = {(pi, di)}i where each di ≤ TB. Length of the browsing session is |B|.
Browsing session is often referred to simply as a session.
Subsequence of an individual browsing session B is a sequence S = {(pi, di)}i where each delay dpi ≤ TS,
and {(pi, di)}i is subset of B. The length of subsequence is |S|.
A browsing session B = {(Si, dsi)}i thus consists of a train of subsequences Si separated by the inactivity
delays dsi.
The sessions delineate tasks of various complexities users undertake in the web environment. The
subsequences correspond to session subgoals; e.g. subsequence S1 is a login, S2 – document download, and
S3 – search for internal resource, etc.
Important issue is determining the appropriate values of TB and TS that segment the user activity into
sessions and subsequences. The former research [16] indicated that student browsing sessions last on average
25.5 minutes. However, we adopt the average maximum attention span of 1 hour as a value for TB. If the
user’s browsing activity was followed by a period of inactivity greater than 1 hour, it is considered a single
session, and the following activity comprises the next session. Value of TS is determined dynamically and
computed as an average delay in a browsing session: TS = 1/N ∑i di. If the delays between page views are
short, it is useful to limit the value of TS from below. This is preferable in environments with frame-based
and/or script generated pages where numerous logs are recorded in a rapid transition. Since our situation
contained both cases, we adjusted the value of TS by bounding it from below by 30 seconds: TS = max (30,
1/N ∑i di). Using these primitives we define navigation space and subspace as follows.
Definition 2 (Navigation Space and Subspace)
Navigation space is a triplet G = (P, B, S) where P is a set of points (e.g. URLs), B is a set of browsing
sessions, and S is a set of subsequences.
Navigation subspace of G is a space A = (D, H, K) where D is subset of P, H is subset of B, and K is
subset of S.
A navigation space can be divided into subspaces based on the nature of detected or defined sequences.
For example, a human navigation space consists of human generated sequences, and a machine navigation
space may contain only the machine generated sequences. Different spaces may have distinctly different
characteristics.
Another important aspect is to observe where the user actions are initiated and terminated. That is, to
identify the starting and the ending points of the subsequences, as well as the single user actions.
Definition 3 (Starter, Attractor, Singleton)
Let G = (P, B, S) be a navigation space and B = {(Si, dsi)}i be a browsing session, and S = {(pk, dk)}k be a
subsequence.
Starter is the first point of an element of subsequence or session with length greater than 1, that is |B| > 1
or |S| > 1.
Attractor is the last point of an element of subsequence or session with length greater that 1, that is |B| >
1 or |S| > 1.
Singleton is a point p of a session or subsequence where |B| = 1 or |S| = 1.
The starters, attractors, and singletons are important points in a navigation space. The starters refer to the
starting navigation points of user actions, whereas the attractors denote the users’ targets. The singletons
relate to the single user actions such as use of hotlists (e.g. history or bookmarks) [17]. Note that a single
point p can be starter, attractor, and/or singleton.
We can formulate behavioral abstractions simply as the pairs of starters and attractors. Then it is equally
important to observe the connecting elements of the transitions from one task (or sub-task) to the other.
164
IADIS International Conference WWW/Internet 2007
165
ISBN: 978-972-8924-44-7 © 2007 IADIS
smaller percentage of resources comprised of static HTML documents (1.18%), PDF documents (1.1%),
DOC documents (0.15%), and others (2.89%), such as downloadable software, spreadsheets, syndicated
resources, updates, etc. Observed IP address space (22077 unique IP addresses) included both static and
dynamic IPs. Larger portion of IP addresses were dynamic due to the extensive DHCP use.
Table 2. Session data statistics after the initial preprocessing.
Figure 1. Histograms: a) average delay between subsequences in sessions, b) average subsequence duration. There are
noticeable spikes in chart a) around 1800 seconds (30 minutes) and 3600 seconds (1 hour). The detailed view is displayed
in subcharts.
We filtered two additional groups of machine generated subsequences: subsequences with delay
periodicity around 30 minutes and 1 hour, and login subsequences. In order to access available intranet
166
IADIS International Conference WWW/Internet 2007
resources the users are required to login into the system. Login process includes validation and results in
several log records with 0 delays. The records are uniquely identifiable and can be easily filtered.
Periodically reoccurring subsequences - those with the periodicity close to 30 minutes and 1 hour -
required further identification. The pool of identified sessions containing the periodic subsequences exposed
the fact that only relatively few unique subsequences (170) caused the peaks in Figure 1. Furthermore, the set
of identified subsequences had only 120 unique URLs. The detected subsequences and URLs were marked as
machine generated and excluded from further analysis.
Table 3. Essential subsequence data statistics.
167
ISBN: 978-972-8924-44-7 © 2007 IADIS
starters, attractors, and singletons (see Figure 2) indicate that higher frequency of occurrences is concentrated
to relatively small number of elements. Approximately ten starters and singletons, and fifty attractors were
very frequent. About one hundred starters and singletons, and one thousand attractors were relatively
frequent. The quantile analysis (Figure 2) reveals that ten starters (appx. 0.0086% of unique valid starters)
and singletons (appx. 0.017% of unique valid singletons), and fifty frequent attractors (appx. 0.017% of
unique valid attractors) accounted for about 20% of total occurrences. One hundred starters (appx. 0.086% of
unique valid starters) and one thousand attractors (appx. 0.35% of unique valid attractors) constituted about
45% and 48% of total occurrences, respectively. Analogously, one hundred twenty singletons (appx. 0.21%
of unique valid singletons) compounded to about 37% of total occurrences.
Figure 2. Histogram and quantile characteristics: a) starters, b) attractors, and c) singletons. Right y-axis contains a
quantile scale. X-axis is in a logarithmic scale.
SE Elements Connectors
Total 7 335 577 3 952 429
Valid 2 392 541 2 346 438
Filtered 4 943 936 1 605 991
Unique 1 540 093 1 142 700
Unique Valid 1 072 340 898 896
Frequent users were familiar with their targets and navigational paths to reach them. Duration of
subsequences in sessions was short - peaking in the interval between two to five seconds (see histogram in
Figure 1-b). During this short time period users were able to navigate through four to five pages (on average)
in order to reach the target (see Table 3). This leads to approximately one second per page transition. Thus
the users had virtually no time to thoroughly browse the page. It is reasonable to assume the knowledge
workers knew where the next navigational point was located on the given page and proceed directly there.
Small number of SE elements and connectors was frequently repetitive. Histogram and quantile charts in
Figure 3 depict re-occurrence of SE elements and connectors. Approximately thirty SE elements and twenty
connectors were very frequent. These thirty SE elements (appx. 0.0028% of unique valid SE elements) and
twenty connectors (appx. 0.0022% of unique valid connectors) accounted for about 20% of total observations.
Knowledge workers formed elemental and complex browsing patterns. Significant repetition of the SE
elements underlines the fact that knowledge workers often initiated their browsing actions from the same
navigation point and targeted often the same resource. This exposes the elemental pattern formation.
Relatively small number of elemental browsing patterns was frequently repetitive. Re-occurrence of
168
IADIS International Conference WWW/Internet 2007
connectors suggests that after completing a browsing sub-task - by reaching the frequently desired target -
they proceeded to the dominant starting point of next sub-task(s). Frequently repeating elemental patterns
interlinked with frequent transitions to other often executed elemental sub-tasks highlights formation of more
complex browsing patterns. Knowledge workers exposed a spectrum of behavioral diversity despite the small
number of highly repetitive SE elements and connectors.
Figure 3. Histogram and quantile characteristics: a) SE elements, and b) connectors. Right y-axis contains a quantile scale.
X-axis is in a logarithmic scale.
7. CONCLUSIONS
A novel analytic framework for exploration and modeling of human browsing behavior in electronic
environments has been presented. The framework was applied to browsing behavior analysis of the
knowledge workers on a large corporate intranet. The users had diverse browsing styles. Numerous vital
behavioral and usability features have been revealed. The knowledge workers had a significant tendency to
form the elemental and complex browsing patterns that were often reiterated. General browsing strategy of
the knowledge workers was remembering the starting point and recalling the navigational path to the target.
The knowledge workers effectively utilized only a small amount of available resources. A large number of
resources have been occasionally accessed. The knowledge workers exhibited little exploratory behavior.
ACKNOWLEDGEMENT
The authors would like to thank Tsukuba Advanced Computing Center (TACC) for providing raw web log
data.
REFERENCES
G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art
and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17:734–749, 2005.
R. Jin, L. Si, and C. Zhai. A study of mixture models for collaborative filtering. Information Retrieval, 9:357–382, 2006.
M. Eirinaki and M. Vazirgiannis. Web mining for web personalization. ACM Transactions on Internet Technology, 3:1–
27, 2003.
R. Baraglia and F. Silvestri. Dynamic personalization of web sites without user intervention. Communications of the
ACM, 50:63–67, 2007.
O. Nasraoui, C. Cardona, and C. Rojas. Using retrieval measures to assess similarity in mining dynamic web clickstreams.
In Proceedings of KDD, pp. 439–448, Chicago, Illinois, USA, 2005.
169
ISBN: 978-972-8924-44-7 © 2007 IADIS
E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In
Proceedings of The 29th SIGIR, pp. 19–26, Seattle, Washington, USA, 2006.
N. Kammenhuber, J. Luxenburger, A. Feldmann, and G. Weikum. Web search clickstreams. In Proceedings of The 6th
ACM SIGCOMM on Internet Measurement, pp. 245–250, Rio de Janeriro, Brazil, 2006.
L.A. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user behavior in www search. In Proceedings of The
27th SIGIR, pp. 478–479, Sheffield, United Kingdom, 2004.
Y-H. Park and P.S. Fader. Modeling browsing behavior at multiple websites. Marketing Science, 23:280–303, 2004.
R.E. Bucklin and C. Sismeiro. A model of web site browsing behavior estimated on clickstream data. Journal of
Marketing Research, 40:249–267, 2003.
M. Deshpande and G. Karypis. Selective markov models for predicting web page accesses. ACM Transactions on
Internet Technology, 4:163–184, 2004.
H. Wu, M. Gordon, K. DeMaagd, and W. Fan. Mining web navigaitons for intelligence. Decision Support Systems,
41:574–591, 2006.
I. Zukerman and D.W. Albrecht. Predictive statistical models for user modeling. User Modeling and User-Adapted
Interaction, 11:5–18, 2001.
J. Jozefowska, A. Lawrynowicz, and T. Lukaszewski. Faster frequent pattern mining from the semantic web. Intelligent
Information Processing and Web Mining, Advances in Soft Computing, pp. 121–130, 2006.
P. Géczy, S. Akaho, N. Izumi, and K. Hasida. Usability analysis framework based on behavioral segmentation. In G.
Psaila and R. Wagner, Eds., Electronic Commerce and Web Technologies, pp. 35–45, Springer-Verlag, Heidelberg,
2007.
L. Catledge and J. Pitkow. Characterizing browsing strategies in the world wide web. Computer Networks and ISDN
Systems, 27:1065–1073, 1995.
M.V. Thakor, W. Borsuk, and M. Kalamas. Hotlists and web browsing behavior–an empirical investigation. Journal of
Business Research, 57:776–786, 2004.
170