Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

IADIS International Conference WWW/Internet 2007

HUMAN WEB BEHAVIOR MINING

Peter Géczy, Noriaki Izumi, Shotaro Akaho, Kôiti Hasida


National Institute of Advanced Industrial Science and Technology (AIST)
1-18-13 Sotokanda, Chiyoda, 101-0021 Tokyo
1-1-1 Umezono, 305-8568 Tsukuba
Japan

ABSTRACT
Extraction, synthesis, and analysis of human behavior in electronic environments are of central importance in web
behavior mining, user modeling, personalization, and recommender systems. We present a novel framework for
elucidating human dynamics in electronic environments. It efficiently captures the spatiotemporal dimensions of human
interactions, and enables the extraction and analysis of the elemental and complex browsing patterns. The framework was
applied to exploration of knowledge worker browsing behavior on a large corporate intranet. Significant tendency to form
repetitive elemental and complex browsing patterns has been revealed. Knowledge workers exhibited diminutive
exploratory behavior and underutilization of available resources. All analyzed aspects of browsing behavior displayed
evident long tail characteristics.

KEYWORDS
Web behavior mining, behavioral abstractions, navigation space, long tails, knowledge workers.

1. INTRODUCTION
Exploration and analysis of human dynamics in electronic environments is rapidly gaining eminent position
in web research and datamining. Effective mining and accurate modeling of human web navigation patterns
and their abstractions has a great significance in engineering and development of recommender systems [1],
collaborative filtering engines [2], behavioral clustering, and personalization [3], [4].
Importance of the human web behavior analysis attracted wide spectrum of research activity. The body of
reported work includes mining clickstream data of page transitions [5], improving web search ranking
utilizing user behavior information [6], [7], web page traversing using eye-tracking studies and devices [8],
and commercial aspects of user behavior analysis [9], [10]. Substantial portion of the research has been
focused on deriving models of user navigation with predictive capabilities – for designing automated
predictors used in recommender systems, collaborative filtering engines, and pre-caching in web servers.
Statistical approaches have been favoring Markov models [11]. However, higher-order Markov models
become exceedingly complex and computationally expensive. Computationally less intensive cluster analysis
methods [12] and adaptive learning strategies [13] have scalability drawbacks. Mining only frequent patterns
reduces the computational complexity and improves the speed, however, at the expense of substantial data
loss [14]. Further progress and more efficient approaches require deeper quantitative and qualitative
understanding of human behavior in electronic environments.
In this paper we focus on knowledge worker user populations and elucidate their browsing behavior on
the large corporate intranet. We present a new analytic framework for exploration and modeling. The
framework allows observation of elemental and complex behavioral patterns and abstractions. We provide a
detailed analysis of characteristics of human behavioral abstractions together with the analysis of the
essential navigation points in the web space.

163
ISBN: 978-972-8924-44-7 © 2007 IADIS

2. APPROACH FORMULATION
Clickstream sequences [15] of user page transitions are divided into sessions, and sessions are further divided
into subsequences. Division is done with respect to the user activity and inactivity. Consider the conventional
time-stamp clickstream sequence of the following form: {(pi, ti)}i, where pi denotes the visited page URLi at
the time ti. For the purpose of analysis this sequence is converted into the form: {(pi, di)}i where di denotes a
delay between the consecutive views pi → pi+1. User browsing activity {(pi, di)}i is divided into subelements
according to the periods of inactivity di satisfying certain criteria.
Definition 1 (Browsing Session, Subsequence, Train)
Let {(pi, di)}i be a sequence of pages pi with delays di between consecutive transitions pi → pi+1.
Browsing session is a sequence B = {(pi, di)}i where each di ≤ TB. Length of the browsing session is |B|.
Browsing session is often referred to simply as a session.
Subsequence of an individual browsing session B is a sequence S = {(pi, di)}i where each delay dpi ≤ TS,
and {(pi, di)}i is subset of B. The length of subsequence is |S|.
A browsing session B = {(Si, dsi)}i thus consists of a train of subsequences Si separated by the inactivity
delays dsi.
The sessions delineate tasks of various complexities users undertake in the web environment. The
subsequences correspond to session subgoals; e.g. subsequence S1 is a login, S2 – document download, and
S3 – search for internal resource, etc.
Important issue is determining the appropriate values of TB and TS that segment the user activity into
sessions and subsequences. The former research [16] indicated that student browsing sessions last on average
25.5 minutes. However, we adopt the average maximum attention span of 1 hour as a value for TB. If the
user’s browsing activity was followed by a period of inactivity greater than 1 hour, it is considered a single
session, and the following activity comprises the next session. Value of TS is determined dynamically and
computed as an average delay in a browsing session: TS = 1/N ∑i di. If the delays between page views are
short, it is useful to limit the value of TS from below. This is preferable in environments with frame-based
and/or script generated pages where numerous logs are recorded in a rapid transition. Since our situation
contained both cases, we adjusted the value of TS by bounding it from below by 30 seconds: TS = max (30,
1/N ∑i di). Using these primitives we define navigation space and subspace as follows.
Definition 2 (Navigation Space and Subspace)
Navigation space is a triplet G = (P, B, S) where P is a set of points (e.g. URLs), B is a set of browsing
sessions, and S is a set of subsequences.
Navigation subspace of G is a space A = (D, H, K) where D is subset of P, H is subset of B, and K is
subset of S.
A navigation space can be divided into subspaces based on the nature of detected or defined sequences.
For example, a human navigation space consists of human generated sequences, and a machine navigation
space may contain only the machine generated sequences. Different spaces may have distinctly different
characteristics.
Another important aspect is to observe where the user actions are initiated and terminated. That is, to
identify the starting and the ending points of the subsequences, as well as the single user actions.
Definition 3 (Starter, Attractor, Singleton)
Let G = (P, B, S) be a navigation space and B = {(Si, dsi)}i be a browsing session, and S = {(pk, dk)}k be a
subsequence.
Starter is the first point of an element of subsequence or session with length greater than 1, that is |B| > 1
or |S| > 1.
Attractor is the last point of an element of subsequence or session with length greater that 1, that is |B| >
1 or |S| > 1.
Singleton is a point p of a session or subsequence where |B| = 1 or |S| = 1.
The starters, attractors, and singletons are important points in a navigation space. The starters refer to the
starting navigation points of user actions, whereas the attractors denote the users’ targets. The singletons
relate to the single user actions such as use of hotlists (e.g. history or bookmarks) [17]. Note that a single
point p can be starter, attractor, and/or singleton.
We can formulate behavioral abstractions simply as the pairs of starters and attractors. Then it is equally
important to observe the connecting elements of the transitions from one task (or sub-task) to the other.

164
IADIS International Conference WWW/Internet 2007

Definition 4 (SE Elements, Connectors)


Let B = {(Si, dsi)}i be a browsing session with consecutive subsequences Si → Si+1, where: Si = {(pik,
dpik)}N
and Si+1 = {(pi+1l, dpi+1l)}M.
SE element (start-end element) of a subsequence Si is a pair SEi = (pi1, piN ).
Connector of subsequences Si and Si+1 is a pair of points Ci = (piN , pi+1,1).
The SE elements outline the higher order abstractions of user subgoals. Knowing the starting point, users
can follow various navigational pathways to reach the target. Focusing on the starting and ending points of
user actions eliminates the variance of navigational choices. The connectors indicate the links between the
elemental browsing patterns. This enables us to observe formation of more complex behavioral patterns as
interconnected sequences of the elemental patterns.

3. DATA AND INTRANET


The case study uses web log data of The National Institute of Advanced Industrial Science and Technology.
Collected data represents a one year period of users’ interactions on a large institutional intranet (basic
information about the data is in Table 1). User population comprised of mainly knowledge workers:
researchers, assistants, technical staff, administrative and managerial cadres.
Intranet portal consisted of six servers connected to the high speed back bone network in load balanced
configuration. Web portal provided a wide spectrum of services ranging from administrative processes,
throughout research support, collaboration, and resource localization, to bulletin boards. Services were
decentralized over the numerous institutional branches in the country. Visible web space was over 1 GB,
whereas the deep space was substantially larger but hard to estimate due to the alternating back-end data.
Table 1. Essential web log data information together with the primary web log statistics.

Data Volume ~60 GB Log Records 315 005 952


Average Daily Volume ~54 MB Clean Log Records 126 483 295
Number of Servers 6 Unique IP Addresses 22 077
Number of Log Files 6814 Unique URLs 3 015 848
Average File Size ~9 MB Scripts 2 855 549
Time Period 3/2005 - 4/2006 HTML Documents 35 535
PDF Documents 33 305
DOC Documents 4 385
Others 87 077
There was a substantial daily traffic resulting in a large data volume. It should be pointed out that the
server side data was exposed to a partial loss due to the proxing and caching. This, however, did not
significantly affect the results of analysis.

4. NAVIGATION SPACE EXTRACTION


Setup. The computational setup consisted of Linux server with MySQL database storage engine. The
database stored both preprocessed and analyzed data in an appropriate format and structure optimized for fast
datamining. Datamining and analytic routines were also optimized for high performance and implemented in
various programming languages.
Preprocessing. The initial phase of data preparation encompassed the data fusion, cleaning, and
database construction. Separate web logs from six load balanced web servers were fused according to the
timestamp and originating IP information. The web log data was contaminated by intranet’s automatic
monitoring software and required cleaning. We also cleaned log records irrelevant to the target analysis -
such as logs from invalid requests, web graphics, style sheets, and client-side scripts. Data was then
structured and uploaded into database. Database was further optimized and indexed.
The initial filtering process significantly reduced the original data volume – by approximately 59.85%
(see Table 1-right). Majority of resources on the intranet were accessed via scripts (94.68%). Substantially

165
ISBN: 978-972-8924-44-7 © 2007 IADIS

smaller percentage of resources comprised of static HTML documents (1.18%), PDF documents (1.1%),
DOC documents (0.15%), and others (2.89%), such as downloadable software, spreadsheets, syndicated
resources, updates, etc. Observed IP address space (22077 unique IP addresses) included both static and
dynamic IPs. Larger portion of IP addresses were dynamic due to the extensive DHCP use.
Table 2. Session data statistics after the initial preprocessing.

Sessions 3 454 243


Unique Sessions 2 704 067
Average Sessions per Day 9 464
Average Session Length 36 [transitions]
Average Session Duration 2 912.23 [s] (48:32)
Average dpi Delay per Session 81.5 [s] (1:22)
Average Sessions per IP Address 156
Session Extraction. Although the logs were in combined log format, the referrer information was not
recorded by the servers, and the clickstream sequences had to be reconstructed based on the IP address and
timestamp information. This was done by temporal ordering of the log records from the same originating IP
address. Reconstructed clickstreams were segmented into sessions according to the user inactivity period dsi
greater than 1 hour (Definition 1).
It has been observed that knowledge worker browsing sessions on organizational intranet (see Table 2)
had longer average duration (appx. 48.5 minutes) than the reported student sessions (appx. 25.5 minutes) [16].
Number of average sessions per IP address (156) suggests that the wide use of dynamic IP addressing
substantially contributed to the user-session space sparsity. A single physical user produced clickstreams
from various IP addresses. One-to-one association of users to particular IP addresses has relevance only in
the case of registered static IP addresses.
Human Navigation Space and Subsequence Extraction. Partitioning of sessions into subsequences
was done with respect to the dynamically calculated separating inactivity period dpi of users’ page transitions
bounded from below by 30 seconds. The lower bound has been observed to be quite appropriate.
A verification of subsequence extraction revealed additional data contamination by machine generated
subsequences. The histogram of average delays between subsequences (Figure 1-a) shows disproportionally
large peaks around 30 minutes and 1 hour intervals. Detailed view (subcharts of Figure 1-a) exposed the
average delay variation of approximately ±3 seconds. High correlation of the peak average subsequence
duration (Figure 1-b) with the average delay variation of ±3 seconds suggests that this precision is highly
unlikely due to the human generated traffic. It represents the machine generated traffic.

Figure 1. Histograms: a) average delay between subsequences in sessions, b) average subsequence duration. There are
noticeable spikes in chart a) around 1800 seconds (30 minutes) and 3600 seconds (1 hour). The detailed view is displayed
in subcharts.
We filtered two additional groups of machine generated subsequences: subsequences with delay
periodicity around 30 minutes and 1 hour, and login subsequences. In order to access available intranet

166
IADIS International Conference WWW/Internet 2007

resources the users are required to login into the system. Login process includes validation and results in
several log records with 0 delays. The records are uniquely identifiable and can be easily filtered.
Periodically reoccurring subsequences - those with the periodicity close to 30 minutes and 1 hour -
required further identification. The pool of identified sessions containing the periodic subsequences exposed
the fact that only relatively few unique subsequences (170) caused the peaks in Figure 1. Furthermore, the set
of identified subsequences had only 120 unique URLs. The detected subsequences and URLs were marked as
machine generated and excluded from further analysis.
Table 3. Essential subsequence data statistics.

Subsequences 7 335 577


Valid Subsequences 3 156 310
Filtered Subsequences 4 179 267
Unique Subsequences 3 547 170
Unique Valid Subsequences 1 644 848
Average Subsequences per Session 3
Average Subsequence Length 4.52 [transitions]
Average Subsequence Duration 30.68 [s]
Average dsi Delay 388.46 [s]
Elimination of machine generated traffic during the subsequence extraction resulted in further reduction
of the total number of subsequences (see Table 3) - by 56.97% (from 7335577 to 3156310). Similarly, the
number of unique subsequences has been reduced - by 46.37% (from 3547170 to 1644848). Exclusion of the
login subsequences decreased the number of subsequences in the initial sessions by 1. Filtering of identified
invalid URLs contributed to the additional reduction of session and subsequence lengths. Since the machine
generated subsequences had rapid transitions with almost 0 delays and durations, the total duration of
subsequences was minimally affected. It has been observed that the average subsequence duration (30.68
seconds) is approximately equal to the chosen lower bound for dsi. This experimentally justifies the right
choice of lower bound value.

5. STARTER, ATTRACTOR, AND SINGLETON ANALYSIS


The extracted knowledge worker navigation space is substantially smaller than the complete navigation space.
The unique valid sets of starters (115770), attractors (288075), and singletons (57 894) are very small in
comparison to the set of unique URLs in the navigation space (see Table 1 and Table 4).
Table 4. Quantitative statistics for starters, attractors, and singletons.

Starters Attractors Singletons


Total 7 335 577 7 335 577 1 326 954
Valid 2 392 541 2 392 541 763 769
Filtered 4 943 936 4 943 936 563 185
Unique 187 452 1 540 093 58 036
Unique Valid 115 770 288 075 57 894
Knowledge workers utilized a small set of starting navigation points and targeted relatively small
spectrum of resources during their browsing. The set of starters, i.e. the initial navigation points of
knowledge workers’ (sub-)goals, was approximately 3.84% of total navigation points. Although the set of
unique attractors, i.e. (sub-)goal targets, was approximately three times bigger than the set of initial
navigation points, it is still relatively minor portion of the complete URL set (appx. 9.55% of unique URLs).
Knowledge workers perceived few resources of value to be bookmarked. Single user actions, such as use
of hotlists [17], followed by the delays greater than 1 hour are represented by the singletons. Number of
singletons was minuscule. Unique singletons accounted for only 1.92% of navigation points. If only small
number of starters and/or attractors was observed to be useful, there is a possibility that they were
bookmarked and accessed directly in the following browsing experiences.
Knowledge workers exhibited minuscule exploratory behavior and focused browsing interests. A small
set of starters, attractors, and singletons was frequently used. The histograms and quantile characteristics of

167
ISBN: 978-972-8924-44-7 © 2007 IADIS

starters, attractors, and singletons (see Figure 2) indicate that higher frequency of occurrences is concentrated
to relatively small number of elements. Approximately ten starters and singletons, and fifty attractors were
very frequent. About one hundred starters and singletons, and one thousand attractors were relatively
frequent. The quantile analysis (Figure 2) reveals that ten starters (appx. 0.0086% of unique valid starters)
and singletons (appx. 0.017% of unique valid singletons), and fifty frequent attractors (appx. 0.017% of
unique valid attractors) accounted for about 20% of total occurrences. One hundred starters (appx. 0.086% of
unique valid starters) and one thousand attractors (appx. 0.35% of unique valid attractors) constituted about
45% and 48% of total occurrences, respectively. Analogously, one hundred twenty singletons (appx. 0.21%
of unique valid singletons) compounded to about 37% of total occurrences.

Figure 2. Histogram and quantile characteristics: a) starters, b) attractors, and c) singletons. Right y-axis contains a
quantile scale. X-axis is in a logarithmic scale.

6. SE ELEMENT AND CONNECTOR ANALYSIS


There was a noticeable reduction of the SE elements and connectors in the knowledge worker navigation
space (see Table 5). Number of SE elements showed decrease by 67.4% (from 7335577 to 2392541) and
connectors by 40.63% (from 3952429 to 2346438). Similarly, reduction is evident in the number of unique
SE elements (30.37%: from 1540093 to 1072340) and connectors (21.34%: from 1142700 to 898896).
Table 5. Statistics for SE Elements and Connectors.

SE Elements Connectors
Total 7 335 577 3 952 429
Valid 2 392 541 2 346 438
Filtered 4 943 936 1 605 991
Unique 1 540 093 1 142 700
Unique Valid 1 072 340 898 896
Frequent users were familiar with their targets and navigational paths to reach them. Duration of
subsequences in sessions was short - peaking in the interval between two to five seconds (see histogram in
Figure 1-b). During this short time period users were able to navigate through four to five pages (on average)
in order to reach the target (see Table 3). This leads to approximately one second per page transition. Thus
the users had virtually no time to thoroughly browse the page. It is reasonable to assume the knowledge
workers knew where the next navigational point was located on the given page and proceed directly there.
Small number of SE elements and connectors was frequently repetitive. Histogram and quantile charts in
Figure 3 depict re-occurrence of SE elements and connectors. Approximately thirty SE elements and twenty
connectors were very frequent. These thirty SE elements (appx. 0.0028% of unique valid SE elements) and
twenty connectors (appx. 0.0022% of unique valid connectors) accounted for about 20% of total observations.
Knowledge workers formed elemental and complex browsing patterns. Significant repetition of the SE
elements underlines the fact that knowledge workers often initiated their browsing actions from the same
navigation point and targeted often the same resource. This exposes the elemental pattern formation.
Relatively small number of elemental browsing patterns was frequently repetitive. Re-occurrence of

168
IADIS International Conference WWW/Internet 2007

connectors suggests that after completing a browsing sub-task - by reaching the frequently desired target -
they proceeded to the dominant starting point of next sub-task(s). Frequently repeating elemental patterns
interlinked with frequent transitions to other often executed elemental sub-tasks highlights formation of more
complex browsing patterns. Knowledge workers exposed a spectrum of behavioral diversity despite the small
number of highly repetitive SE elements and connectors.

Figure 3. Histogram and quantile characteristics: a) SE elements, and b) connectors. Right y-axis contains a quantile scale.
X-axis is in a logarithmic scale.

7. CONCLUSIONS
A novel analytic framework for exploration and modeling of human browsing behavior in electronic
environments has been presented. The framework was applied to browsing behavior analysis of the
knowledge workers on a large corporate intranet. The users had diverse browsing styles. Numerous vital
behavioral and usability features have been revealed. The knowledge workers had a significant tendency to
form the elemental and complex browsing patterns that were often reiterated. General browsing strategy of
the knowledge workers was remembering the starting point and recalling the navigational path to the target.
The knowledge workers effectively utilized only a small amount of available resources. A large number of
resources have been occasionally accessed. The knowledge workers exhibited little exploratory behavior.

ACKNOWLEDGEMENT
The authors would like to thank Tsukuba Advanced Computing Center (TACC) for providing raw web log
data.

REFERENCES
G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art
and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17:734–749, 2005.
R. Jin, L. Si, and C. Zhai. A study of mixture models for collaborative filtering. Information Retrieval, 9:357–382, 2006.
M. Eirinaki and M. Vazirgiannis. Web mining for web personalization. ACM Transactions on Internet Technology, 3:1–
27, 2003.
R. Baraglia and F. Silvestri. Dynamic personalization of web sites without user intervention. Communications of the
ACM, 50:63–67, 2007.
O. Nasraoui, C. Cardona, and C. Rojas. Using retrieval measures to assess similarity in mining dynamic web clickstreams.
In Proceedings of KDD, pp. 439–448, Chicago, Illinois, USA, 2005.

169
ISBN: 978-972-8924-44-7 © 2007 IADIS

E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In
Proceedings of The 29th SIGIR, pp. 19–26, Seattle, Washington, USA, 2006.
N. Kammenhuber, J. Luxenburger, A. Feldmann, and G. Weikum. Web search clickstreams. In Proceedings of The 6th
ACM SIGCOMM on Internet Measurement, pp. 245–250, Rio de Janeriro, Brazil, 2006.
L.A. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user behavior in www search. In Proceedings of The
27th SIGIR, pp. 478–479, Sheffield, United Kingdom, 2004.
Y-H. Park and P.S. Fader. Modeling browsing behavior at multiple websites. Marketing Science, 23:280–303, 2004.
R.E. Bucklin and C. Sismeiro. A model of web site browsing behavior estimated on clickstream data. Journal of
Marketing Research, 40:249–267, 2003.
M. Deshpande and G. Karypis. Selective markov models for predicting web page accesses. ACM Transactions on
Internet Technology, 4:163–184, 2004.
H. Wu, M. Gordon, K. DeMaagd, and W. Fan. Mining web navigaitons for intelligence. Decision Support Systems,
41:574–591, 2006.
I. Zukerman and D.W. Albrecht. Predictive statistical models for user modeling. User Modeling and User-Adapted
Interaction, 11:5–18, 2001.
J. Jozefowska, A. Lawrynowicz, and T. Lukaszewski. Faster frequent pattern mining from the semantic web. Intelligent
Information Processing and Web Mining, Advances in Soft Computing, pp. 121–130, 2006.
P. Géczy, S. Akaho, N. Izumi, and K. Hasida. Usability analysis framework based on behavioral segmentation. In G.
Psaila and R. Wagner, Eds., Electronic Commerce and Web Technologies, pp. 35–45, Springer-Verlag, Heidelberg,
2007.
L. Catledge and J. Pitkow. Characterizing browsing strategies in the world wide web. Computer Networks and ISDN
Systems, 27:1065–1073, 1995.
M.V. Thakor, W. Borsuk, and M. Kalamas. Hotlists and web browsing behavior–an empirical investigation. Journal of
Business Research, 57:776–786, 2004.

170

You might also like