Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Testing the Suitability of Markov Chains as Web Usage Models

Zhao Li and Jeff Tian


Southern Methodist University, Dallas, Texas, USA
E-mail: {lizhao, tian}@engr.smu.edu
Abstract
Markov chains have been used to model web usages and
served as the basis for statistical testing, performance evaluation, and reliability analysis. However, most of such applications of Markov chains were carried out without answering the question: Can web usage be accurately modeled by Markov chains? In this paper, we propose a set of
tests, which are easy to perform based on information about
web link usage frequencies gathered from web server logs,
to answer this question. We applied this approach to our
universitys web site, and our results indicate that Markov
chains can provide fairly accurate models of web usages.
keywords: Web usage and quality assurance, Markov
chain, memoryless property, statistical testing.

the ofcial web site for the School of Engineering and Applied Science at Southern Methodist University. The results
provided us with empirical evidence that Markov chains can
provide fairly accurate models of web usages.

2. Markov Chains as Web Usage Models


Markov chains form an important subclass of stochastic
processes that can be used to simplify the analysis of many
complex systems [3]. Recently, they have been used in statistical testing and quality assurance for web applications
[2]. Related failure observations can be fed to various models for reliability assessment or to identify problems for focused reliability improvement [5, 8].

2.1. Markov chains for computer based systems

1. Introduction
With the prevalence of the world wide web and peoples
reliance on it in society today, ensuring satisfactory performance and reliability of web servers and web sites is becoming increasingly important. Because of the user focus and
the large size of the web, exhaustive testing suitable for individual web pages or small web sites, such as various link
checkers, or traditional coverage-based testing techniques,
need to be modied or used selectively to remain practical or feasible. A good candidate for effective, large-scale
web testing and quality assurance is statistical testing and
related reliability analysis [5, 10]. Because of the close resemblance between web applications and the state transition
mechanism in Markov chains [3], statistical testing based on
actual usage patterns and frequencies captured in Markov
chains is a natural choice for web testing and quality assurance [2].
To test the suitability of Markov chains to model web
usages, we gathered information about web link usage frequencies from web logs. In particular, we propose to use
a small set of tests to check the conformance by these actual usage frequencies to the so-called memoryless property
that all Markov chains satisfy. We applied this approach to

With the increased demand and reliance on computers


and information technology for service, functionality, and
automation in todays society, many software and computer
based systems, such as the Internet, embedded computer
control systems, and various software systems, are getting
larger, more complex, and increasingly ubiquitous. There is
an urgent need for appropriate models and techniques that
can be used to characterize such systems and to assure their
satisfactory performance and reliability.
A logical approach to deal with such complex systems
is to decompose them into functional entities consisting of
units or subsystems and then connect them according to the
system architecture. However, because of the large size, increased complexity, as well as the complex interconnections
and interactions among numerous units, deterministic or exhaustive methods that attempt to cover system components or functions become increasingly impractical or infeasible. Effective testing and quality assurance for such
systems requires that we 1) produce simplied models that
still reect the essential characteristics of the systems being studied, and/or 2) provide selective coverage based on
actual usage scenarios and frequencies. Models based on
Markov chains fulll both these requirements.
In Markov chains for such a system, each node or state
represents a functional or structural unit (unit of work or

Proceedings of the 27th Annual International Computer Software and Applications Conference (COMPSAC03)
0730-3157/03 $ 17.00 2003 IEEE

a system component). The states are interconnected to reect the functional or structural interconnections in the system, and the state transition probabilities give us the information about how likely such transitions are going to take
place. An operational sequence consisting of visits to multiple states can be constructed by following the state transitions. The likelihood for a particular sequence to happen
can also be easily calculated by the product of its individual state transition probabilities. Therefore, Markov chains
can be used to ensure performance and reliability based on
usage scenarios and frequencies by target customers.

CSE home

0.01

0.3
0.3

0.19

Other info

0.5

Courses

Programs
0.4

0.4

2.2. Markov chains for web applications


For web applications, it is easy to abstract the set of states
and state transitions. The behavior of the web users could be
considered as hits and goes: A web user visits a web page,
browsing or obtaining necessary information, and then click
a link and go to another page. Therefore, individual pages
or collections of related pages could be treated as individual
states in the corresponding Markov chain. Usage or browsing patterns as well as the likelihood for these patterns to
be followed are captured in the state transition probabilities of the Markov chain [6].
Recently, we developed unied Markov models (UMMs)
for web applications as well as for general software systems
[2]. Our UMMs capture information about execution ow,
information ow, workload creation, handling, and termination, and associated probabilistic usage information. This
information is represented in our UMMs as a set of hierarchical Markov chains that can be used to support statistical
testing, performance evaluation, and reliability analysis.
As an example, web navigation and usage for the web
pages of the Dept. of Computer Science and Engineering at
Southern Methodist University (SMU/CSE) were modeled
by an UMM in Figure 1. This top level Markov chain represents the high level operational units (states), and associated
connections (transitions) and usage probabilities (numerical
quantities written by the links). Various sub-operations may
be associated with an individual state, and could be modeled by more detailed models (not shown) by expanding the
state. This hierarchical structure and the associated exibility that can be tailored to multi-purpose applications set our
approach apart from earlier approaches to statistical testing
using Markov chains such as in [10].

0.5
0.3

0.6

0.2
0.3

Exit

Figure 1. Markov chain for SMU/CSE pages


[3]. When formally stated, the probability for someone X (a
web user, in this case) to be at the future state sn+1 = j
at time period n + 1 is uniquely determined by the current
state sn = i at time period n and the state transition probability pij , i.e,,
P {Xn+1 = sn+1 |Xn = sn , Xn1 = sn1 , . . . , X0 = s0 }
= P {Xn+1 = sn+1 |Xn = sn } = pij
Most users of Markov models either implicitly or explicitly assume the memoryless property without much justication, except sometimes pointing out that this simplication is necessary to keep the system analysis tractable. In
the analysis of computer systems and networks and in related performance evaluation using Markov chains, some
practical evidence of memoryless property as well as cross
validation of analysis results by simulation or actual measurement have been provided in various research [9]. However, no such evidence or cross validation has been provided
for Markov models used in statistical web testing and quality assurance. We attempt to provide such evidence in
this paper.

3. Validating Markovian Property


If we could obtain history-dependent state transition
probabilities, we could then compare them to the historyindependent ones in Markov chains. A close conformance
between the two would validate the use of Markov chains
to model web usages.

2.3. Memoryless or Markovian property


3.1. General approach and steps
Key to the above usages of Markov chains is the simplication based on the so called memoryless property, or
the Markovian property, which states that the state transitions from a given state depend only on the current state,
but not the history or how we reached that particular state

In order to compare the history-dependent (nonMarkovian) and history-independent (memoryless or


Markovian) state transition probabilities, we can use the
following general procedure:

Proceedings of the 27th Annual International Computer Software and Applications Conference (COMPSAC03)
0730-3157/03 $ 17.00 2003 IEEE

129.120.10.122 - - [16/Aug/1999:07:44:20 -0500] "GET / HTTP/1.1" 200 10787 "http://


www.smu.edu/academics.html" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)"
...
129.120.10.122 - - [16/Aug/1999:07:44:43 -0500] "GET /ee/ HTTP/1.1" 200 5868
"http://www.seas.smu.edu/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)"

Table 1. Sample entries in an access log


Obtain the history-independent state transition probabilities as we would normally do for Markov chains for
web applications.
Obtain the history-dependent state transition probabilities as conditional probabilities based on previous visits or navigation sequences.
Compare the different sets of state transition probabilities obtained above.
Based on this procedure, we can conclude that web usages can be accurately modeled by Markov chains if the
above probability sets are similar. Otherwise, memoryless
property does not hold, and alternative models other than
Markov chains should be considered. The ways to obtain
these sets of state transition probabilities are described next.

3.2. History-independent transition probabilities


In our previous research on statistical web testing and
quality assurance, we have developed an approach to construct Markov chains based on information extracted from
various web logs routinely kept at web servers [2]. We can
use this procedure to construct our normal Markov chains
as the basis to test the validity of the memoryless property
in web usage. The basic technique and procedure is summarized below.
Every user accesses to a web page, or a hit, is logged as
a separate entry in the web servers access log. Sample entries from an access log for www.seas.smu.edu, the ofcial web site for the School of Engineering and Applied
Science, Southern Methodist University (SMU/SEAS) web
site using Apache Web Server [1] are given in Figure 1. Information recorded at web servers typically includes the following: the requesting computer, the date and time of the request, the le that the client requested (the referred page),
HTTP status code, the size of the requested le, the referring URL (the referrer), and client name.
First, the components that make up the system and their
connections are identied from web sources. Various related individual web pages may be grouped into a single
state in the Markov chain. The associated hit information,
particularly the referring-referred page pairs, after proper
processing, gives us the history-independent state transition
probabilities.

Our approach to Markov chain construction based on


web logs is similar to the use of data mining techniques
on web logs for web site evaluation in [7]. However, the
navigation patterns and frequencies in [7] (data mining results in tree structures) form a loose collection of results
not connected to the web site structure. The focus therein
is on achieving business goals or design intentions. In contrast, the primary goal of our web usage measurement is to
construct integrated models, our Markov chains for statistical testing and quality assurance.

3.3. History-dependent transition probabilities


In this research, we used information about referring
pages and referred pages to trace the hit history in order to
validate the Markovian property for web applications. However, it would be extremely difcult to trace the complete
history. In addition, such complete traces would result in
very few data points for statistical determination of transition probabilities. Therefore, we examined the more limited
dependency between 1) the transition from current state to
the next state, and 2) the previous state. This can be interpreted as an approximate validation of Markovian property,
i.e., to check if the state transitions are independent of the
immediate past history instead of the complete history. Formally stated, we want to check if the following is true:
P {Xn+1 = sn+1 |Xn = sn , Xn1 = sn1 }
= P {Xn+1 = sn+1 |Xn = sn } = pij
To establish state transition probabilities that are dependent on such immediate history, we need to establish the
connection between one referring page, say A, to another
B, and from B to a third page C. If both the A B and
B C referral pairs are requested by the same user, as identied by the unique IP address in our access logs, within a
small time window, we can interpret them as in a single reference sequence within the same session. Otherwise, they
would be treated as unrelated referrals.
For example, the rst and the last sample entries in Table 1 originated from the same user within a short time
period. These two entries form a reference sequence from
academics page
www.smu.edu/academics.html
to the SMU/SEAS homepage

Proceedings of the 27th Annual International Computer Software and Applications Conference (COMPSAC03)
0730-3157/03 $ 17.00 2003 IEEE

Rank
1
2
3
4
5
6
7
8
9
10

Page
/index.html
/justin/inline.html
/justin/inline m.rail.html
/ce/index.html
/co/cams/index.html
/ce/seas/index.html
/co/cams/index.html
/ce/smu/index.html
/cse/index.html
/netech/index.html

#hits
19418
5140
4733
4013
3472
2890
2571
2422
2257
2230

Table 2. Top-hit web pages


www.seas.smu.edu/
and then to the EE Dept page
www.seas.smu.edu/ee/
In what follows, we use 2 hours as the session cutoff, similar to that dened in [4]. Because most of our web pages
are static ones, they do not require a shorter session timeout such as for E-commerce.
Once such reference sequences, from A to B, and then
B to Cs, are established following the procedure above, we
can count the number of such 2-legged sequences. We can
then normalize them among all the outlinks (Cs) from B
originated from A to obtain the one-step history-dependent
state transition probabilities.

4. A Case Study
The above approach for validating the Markovian property for web applications was applied in a case study.
We used web logs covering 26 days of data from the
School of Engineering and Applied Science at Southern Methodist University (SMU/SEAS) web site at
www.seas.smu.edu.

4.1. Candidate page selection


Since we intended to compare the different sets of state
transition probabilities as outlined in the previous section,
we would like to have reliable estimates for these probabilities. As stated before, these probabilities are estimated from
relative reference counts for individual links. Therefore, the
larger the reference counts, the more condence, or statistical validity, we can attach to the estimated state transition
probabilities. Therefore, the rst step for choosing our study
samples is to examine the reference counts and to select the
top hit pages to serve as reference sources (outlinks) or destinations (inlinks).
Table 2 gives top hit pages, or the most popular pages, in
the SMU/SEAS web site. From Table 2, we can see that the

Rank Page
1 /cse/index.html
2 /ce/seas/index.html
3 /ee/index.html
4 /disted/index.html
5 /gradadmissions/index.html
6 /gradinfo.html
7 /ecommerce/index.html
8 /me/index.html
9 /infodata.html
10 /ugradinfo.html
11 /textonly.html
12 /hear.htm
13 /students.html
14 /disted/ship/index.html
15 /building.html
16 /env/index.html
17 /co/index.html
18 /seasnews.html
19 /contactseas.html
20 /recruit/index.html
15 outlinks with ref counts < 100
total over all 35 outlinks

#hits
1258
1145
1140
547
433
314
305
281
239
221
188
170
159
150
137
132
118
116
115
109
334
7611

pij
0.165
0.150
0.150
0.072
0.057
0.041
0.040
0.037
0.031
0.029
0.025
0.022
0.021
0.020
0.018
0.017
0.016
0.015
0.015
0.014
0.044
1

Table 3. top outlinks from /index.html

No.1 top-hit page, the SMU/SEAS homepage


www.seas.smu.edu/index.html
is associated with many more hits than any other pages.
Therefore, we selected this page and related inlinks and outlinks as the objects for our validation study.

4.2. History-independent transition probabilities


Among the 35 outlinks present in the SMU/SEAS homepage, the most popular ones are shown in Table 3, giving
the total reference counts by all web users for each link followed after visiting the SMU/SEAS homepage. When normalized by the total outlink reference count for all pages
from this page, 7611 in this case, we get the state transition
probabilities pij from SMU/SEAS homepage i to page j for
the corresponding outlink, as in the corresponding Markov
chain. This set of probabilities {pij } gives us the historyindependent transition probabilities from the SMU/SEAS
homepage, against which we will compare other historydependent state transition probabilities.
Notice that the total number of inlink reference count
from a specic inlink does not often agree with the total
number of outlink reference count from the inlink-current
link pair. For example, the total inlink count of 19418 for
the SMU/SEAS homepage is different from its total outlink
count of 7611. There may be several reasons for this dis-

Proceedings of the 27th Annual International Computer Software and Applications Conference (COMPSAC03)
0730-3157/03 $ 17.00 2003 IEEE

Rank
1
2
3
4
5
6
7
8
9
10

Page
www.smu.edu/academics.html
www.smu.edu/graduate/
www.smu.edu/sitemap.html
www.smu.edu/admissions/academics/
engineering.html
search.yahoo.com/bin/search
search.msn.com/results.asp
www.goto.com/d/search/p/netscape/
netnd2.aol.com/results.adp
voled.doded.mil/dantes/dl/extdeg.htm
ink.yahoo.com/bin/query

#hits
1912
146
105
100
49
44
43
38
28
22

is 16060. The total number for the corresponding outlink


reference count from the SMU/SEAS homepage originated
from this default inlink is 4812. Consequently, these empty
inlink cases are also considered collectively as a single inlink in this study.
Therefore, we gathered in this paper outlink usage frequencies corresponding to the top four inlinks in Table 4
and empty inlink. The detailed results can be found in data
tables posted online at:
www.engr.smu.edu/tian/data/markovt.pdf

4.4. Validating Markovian property

Table 4. top inlinks to /index.html


agreement, including:
A user may terminate his/her web browsing activity after hitting the current page, use the bookmarks, or directly type in an URL, resulting in no additional outlinks originating from the current page being recorded
in the access log or being counted.
A user may follow one outlink, then use the back
navigation button on the browser, and then follow another outlink. In this case, the action following back
button is not recorded in the web log, because local
cache is typically used in this case without request the
loading of a web page from the server side. This situation would result in more outlink references than corresponding inlink ones.

4.3. History-dependent transition probabilities


As stated earlier, we can approximately check the validity of Markovian property by obtaining the set of state transition probabilities conditioned on the previous page visited. The most popular inlinks to this homepage, or the previous pages visited before visiting the homepage, is shown
in Table 4. To ensure statistical validity of this approach, we
need to restrict our study to those previous pages or inlinks
with high reference counts. Because the number of references for each inlink after the engineering.html page is
too small to be considered, we focus on the top 4 inlinks
to the SMU/SEAS homepage and gathered the corresponding outlink distribution information. The outlink reference
counts originated from these 4 inlinks are 1214, 107, 75, 97,
respectively.
Absent from the collection of top inlinks in Table 4 is
the case of empty referring page. This may represent situations where the SMU/SEAS homepage is used as the startup
page, as a bookmarked page, or it is directly typed in by
a user. The total number of such no referrer (default inlink) cases reference count to the SMU/SEAS homepage

A preliminary comparison can be made based on the raw


outlink counts presented above. There is a remarkable similarity among the top outlinks. Among the top 10 unconditional outlinks in Table 3 from our Markov chain, 9, 6, 8,
7, and 8 of them also appeared among the conditional top
10 outlinks respectively for the ve inlinks dened above.
In addition, the rest of these unconditional top 10 outlinks
are among the top 20 top outlinks in the conditional groups.
This observation has an important practical implication: If
we are performing statistical testing by rst selecting the
top 10 frequently used outlinks from the SMU/SEAS homepage following our Markov chain, it would cover all the
frequently used (top 20) such outlinks originated from major inlink sources to this page. Statistical testing based on
Markov chains would be a viable and effective way to test
such systems and ensure their reliability.
From these raw outlink distribution data, we can obtain
the transition probabilities. However, individual outlink reference counts can be fairly low. Therefore, to make valid
statistical comparison, we only calculated top 10 outlink
probabilities for the following two cases:
Conditional outlink probabilities with empty referral
page as inlink to the SMU/SEAS homepage.
Conditional outlink probabilities with
www.smu.edu/academics.html
as inlink to the SMU/SEAS homepage.
Table 5 compares these probabilities against unconditional outlink probabilities {pij } from the Markov chain in
Table 3. As can be clearly seen from this comparison, the
outlink probability distributions are similar regardless of the
inlink history.
Both our qualitative comparison based on outlink rankings and our quantitative comparison based on top 10 outlink probabilities point to the same conclusion: Markov
chains can provide a fairly accurate characterization of web
usages, thus can be used as the basis for statistical web testing and quality assurance.

Proceedings of the 27th Annual International Computer Software and Applications Conference (COMPSAC03)
0730-3157/03 $ 17.00 2003 IEEE

inlink
outlink
(/ = www.seas.smu.edu)
/cse/index.html
/ce/seas/index.html
/ee/index.html
/disted/index.html
/gradadmissions/index.html

/gradinfo.html
/ecommerce/index.html
/me/index.html
/infodata.html
/ugradinfo.html

all
(pij )
0.165
0.150
0.150
0.072
0.057
0.041
0.040
0.037
0.031
0.029

empty

/academics.html

0.159
0.168
0.153
0.069
0.057
0.040
0.041
0.025
0.032
0.019

0.256
0.073
0.213
0.049
0.055
0.036
0.045
0.074
0.027
0

Table 5. State transition probabilities for top


10 outlinks from /index.html and selected
inlinks

5. Conclusion and Perspectives


Our answer to the opening question Can web usage be accurately modeled by Markov chains? is afrmative, based on web usage patterns observed at the web
site of our university. Our empirical validation of the memoryless property for Markov chains supports the application
of Markov chains as the usage model in statistical testing
and quality assurance. In addition, our approach only requires a small set of tests based on data that can be easily
extracted from web server logs. This approach is more practical than the full test of memoryless property which would
require prohibitive high cost and large amount of data.
The SMU/SEAS web site shares many common characteristics of web sites for educational institutions, making
our results meaningful to many similar application environments. However, as an empirical study restricted to academic settings, our study also suffers from various limitations. To overcome these limitations, we plan to obtain some
public domain web logs, such as from the W3C Web Characterization Repository at repository.cs.vt.edu
to collect more empirical evidence to further validate the
memoryless property. We also plan to continue our collaboration with industrial partners, including Nortel Networks,
IBM, and Lockheed-Martin, to use Markov chains for statistical testing and quality assurance for their software
systems.
Another limitation of our study is the relatively short
length of the 26 day period covered by our web logs, which
does not supply data set large enough to provide strong
evidence of the conformance to the memoryless property
of Markov chains. However, using longer term data would
also be problematic, because the web changes and evolution would induce new usage patterns that may obstruct our

testing of the memoryless property for Markov chains. A


good solution to this problem is to seek web sites with heavier trafc so that a larger amount of data for some stable
usage patterns can be analyzed to validate related Markov
chains. This can also be done in combination with the activities mentioned above using external web logs.
There are various other problems that we plan to address
with larger quantities of data from diverse sources, including the evaluation of overall Markov chain structure and the
effect of loops, proper grouping of low use frequency pages
to form individual nodes, and integration with other software testing and quality assurance techniques. These extensions to our current study would help us nd a more effective strategy for web testing to ensure that web performance
and reliability are satisfactory to the massive web user population.

Acknowledgments
This research is supported in part by NSF grants
9733588 and 0204345, THECB/ATP grants 003613-00301999 and 003613-0030-2001, and Nortel Networks.
We also thank the other members of our research group,
Gunes Koru, Sunita Rudraraju, Li Ma, and Sudipti Mishra,
for their comments and suggestions.

References
[1] B. Behlandorf. Running a Perfect Web Site with Apache, 2nd
Ed. MacMillan Computer Publishing, New York, 1996.
[2] C. Kallepalli and J. Tian. Measuring and modeling usage and
reliability for statistical web testing. IEEE Trans. on Software Engineering, 27(11):10231036, Nov. 2001.
[3] S. Karlin and H. M. Taylor. A First Course in Stochastic Processes, 2nd Ed. Academic Press, New York, 1975.
[4] A. L. Montgomery and C. Faloutsos. Identifying web browsing trends and patterns. IEEE Computer, 34(7):9495, July
2001.
[5] J. D. Musa. Software Reliability Engineering. McGraw-Hill,
New York, 1998.
[6] R. R. Sarukkai. Link prediction and path analysis using
Markov chains. In Proc. 9th International World Wide Web
Conference, Amsterdam, the Netherlands, May 2000.
[7] M. Spiliopoulou. Web usage mining for web site evaluation.
Communications of the ACM, 43(8):127134, Aug. 2000.
[8] J. Tian. Integrating time domain and input domain analyses
of software reliability using tree-based models. IEEE Trans.
on Software Engineering, 21(12):945958, Dec. 1995.
[9] K. S. Trivedi. Probability and Statistics with Reliability,
Queuing, and Computer Science Applications, 2nd Edition.
John Wiley and Sons, New York, 2001.
[10] J. A. Whittaker and M. G. Thomason. A Markov chain model
for statistical software testing. IEEE Trans. on Software Engineering, 20(10):812824, Oct. 1994.

Proceedings of the 27th Annual International Computer Software and Applications Conference (COMPSAC03)
0730-3157/03 $ 17.00 2003 IEEE

You might also like