Foundations of Web Technology

FOUNDATIONS OF
WEB TECHNOLOGY
THE KLUWER INTERNATIONAL SERIES
IN ENGINEERING AND COMPUTER SCIENCE
FOUNDATIONS OF
WEB TECHNOLOGY
by
Ramesh R. Sarukkai
Senior Architect, Yahoo Inc, USA.
SPRINGER SCIENCE+BUSINESS MEDIA, LLC

Library of Congress Cataloging-in-Publication Data
Sarukkai, Ramesh R.
Foundations of Web Technology
ISBN 978-1-4613-5409-3 ISBN 978-1-4615-1135-9 (eBook)
DOI 10.1007/978-1-4615-1135-9
Copyright © 2002 by Springer Science+Business Media New York

Originally published by Kluwer Academic Publishers in 2002
Softcover reprint ofthe hardcover Ist edition 2002
AlI rights reserved. No part ofthis work may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without the written
permission from the Publisher, with the exception of any material supplied
specifically for the purpose of being entered and executed on a computer system,
for exclusive use by the purchaser ofthe work.
Permission for books published in Europe: permissions@wkap.n!

Permissions for books published in the United States of America: permissions@wkap.com
Printed an acid-free paper.

Dedicated to my mother Santha and my father S.K. Rangarajan
Contents
Contributors xv
Acknowledgements XVll
Preface xix
Part 1 Fundamentals
1 Introduction 3
1. WORLD WIDE WEB 3
2. CORE TECHNOLOGY 4
3. WHAT'S COVERED IN THIS BOOK 4
4. ORGANIZATION OF THE BOOK 6
2 Data Markup 11
1. INTRODUCTION 12
2. DATA MARKUP 13
3. EXTENSIBLE MARKUP LANGUAGE (XML) 17
4. EXTENSIBLE STYLE SHEETS 29
5. XPATH 41
6. HYPERTEXT MARKUP LANGUAGE (HTML) 41
7. CONCLUSION 50
FuRTHER READING 50
EXERCISES 50
Contents
3 Networking 53
1. INTRODUCTION 54
2. LAYERING OF NETWORKS 54
3. LOCATING ENDPOINTS 55
4. TRANSMISSION PROTOCOLS 59
5. CLIENT/SERVER 64
6. HYPER TEXT TRANSFER PROTOCOL (HTIP) 71
7. WEB SECURITY 78
8. PRIVACY 84
9. CONCLUSION 84
FuRTHER READING 85
EXERCISES 85
4 Infonnation Retrieval 87
1. INTRODUCTION 88
2. COMPONENTS OF IR SYSTEM 89
3. TEXT PROCESSING 90
4. INDEXING AND SEARCH 96
5. RANKING 100
6. QUERY OPERATIONS 104
7. LATENT SEMANTIC INDEXING 106
8. EVALUATION METRICS 108
9. CONCLUSIONS 110
FuRTHER READING 110
EXERCISES III
Part II Applications 113
5 Web Search and Directory 115

1. INTRODUCTION 116
2. WEB SEARCH 116
3. VARIATIONS IN SEARCHING 125
4. RANKING 128
5. WEB DIRECTORIES 132
6. CONCLUSION 135
FuRTHER READING 136
EXERCISES 136
6 Web Mining 139

1. INTRODUCTION 140
2. DATA MINING 140
3. AsSOCIATION MINING 141
4. PREDICTIVE MODELLING 145
Contents
5. CLUSTERING 157
6. OTHER DATA MINING PROBLEMS 165
7. EXAMPLES OF WEB MINING 165
8. CONCLUSION 172
FuRTHER READING 172
EXERCISES 173
7 Messaging and Commerce 177

1. INTRODUCTION 178
2. MESSAGING APPLICATIONS 178
3. ELECTRONIC MAIL PROTOCOLS 179
4. 1M ARCHITECTURE 184
5. COMMERCE APPLICATIONS 187
6. OVERVIEW OF E-COMMERCE FRAMEWORKS 188
7. EXAMPLE ARCHITECTURE 196
8. CONCLUSION 205
FuRTHER READING 205
EXERCISES 206
8 Mobile Access 207

1. INTRODUCTION 208
2. MOBILE COMMUNICATION SYSTEMS 208
3. WIRELESS APPLICATION PROTOCOL 211
4. WIRELESS MARKUP LANGUAGES 214
5. GENERATING WIRELESS CONTENT 221
6. SHORT MESSAGING SERVICE 227
7. EMERGING TRENDS 230
8. CONCLUSION 233
FuRTHER READING 233
EXERCISES 234
9 Web Services 237

1. INTRODUCTION 238
2. OVERVIEW OF ARCHITECTURE 238
3. UDDI 241
4. SOAP 242
5. PLATFORMS 244
6. EXAMPLE OF A SERVICE 244
7. LIMITATIONS 248
8. CONCLUSION 249
FuRTHER READING 250
EXERCISES 250
Contents
10 Conclusion 251
1. REvIEw 251
2. SYSTEM DESIGN OVERVIEW 254
3. LIMITAnONS 256
4. ThE fuTURE 257
APPENDIX 261
REFERENCES 271
ACRONYMS 283
INDEX 285
List of Figures
Figure 1. Growth of the World Wide Web .4

Figure 2. Information - Structure - Presentation Separation 13
Figure 3. Graphical representation of infonnation structure 14
Figure 4. History of Markup Languages .16
Figure 5. Illustration ofXSL 31
Figure 6. Transfonning one XML structure to another XML structure 33
Figure 7. Example ofHTML fonn output. 47
Figure 8. Client Server Architecture 64
Figure 9. Illustration of a proxy server scenario 65
Figure 10. Simple example of encrypted message transmission 80
Figure 11. Overview ofInfonnation Retrieval System 89
Figure 12. Example of a "Similarity Matrix" 93
Figure 13. Example of a prefix tree 98
Figure 14. Document vectors for the two sample documents 102
Figure 15. Documents Matrix representation A. .1 07
Figure 16. Precision versus Recall Graph 110
Figure 17. Overview of Web search system. 117
Figure 18. Web Crawling System 119
Figure 19. Meta-Search Engine .126
Figure 20. Graph Structure used to illustrate the HITS algorithm 131
Figure 21. Web Directory - fixed taxonomy, but automatic classification. 133
Figure 22. Example of Semi-Automatic Taxonomy Generation 134
Figure 23. Web Graph for exercise (7) 137
Figure 24. Single layer neural network .149
Figure 25. Example of a decision tree for classification 151
Figure 26. Example of a Linear Classifier 156
Figure 27. Illustration of clustering 157
Figure 28. Clustered data samples and the two centroids 164
Figure 29. Overview of an e-mail system. 180
Figure 3 O. Overview of prototype 1M system. 185
Figure 31. Prototype E-commerce architecture 197
Figure 32. Pricing and Packaging 198
Figure 33. Subscription module for billing 200
Figure 34. Overview of Global System for Mobile Communication 21 0
Figure 35. WAP Architecture Overview 211
Figure 36. Two approaches to generating wireless markup 222
Figure 37. Transcoding Proxy Architecture 224
Figure 38. XSLT Approach to Wireless Markup Document Generation 225
Figure 39. Overview of SMS Architecture 228
Figure 40. Web services protocol Stack 240
Contents
Figure 41. Web service integration procedure 242

Figure 42. Overview of Web Service Usage 245
List of Tables
Table 1. Layers of data abstraction 13

Table 2. Example XML Document... .19
Table 3. Example XML document with external DTD 20
Table 4. Contents of the example DTD .20
Table 5. Example defining attributes for an element. 22
Table 6. XML Schema example 25
Table 7. ComplexType in XML Schema 26
Table 8. Example illustrating reference features in XML Schema .27
Table 9. Illustration of XSL specification in an XML document. 31
Table 10. Example stylesheet definition .32
Table 11. Result XML document when stylesheet in Table lOis applied to
XML in Table 9 33
Table 12. Example of XPath expressions .4 1
Table 13. Example of HTML table rendering .46
Table 14. IP Datagram 56
Table 15. IP Datagram containing TCP segment. 60
Table 16. UDP packets encapsulated in IP datagrams 63
Table 17. HTTP 1.0 Client Request... 71
Table 18. HTTP status codes 73
Table 19. Sample HTTP/l.l Response codes absent in HTTP/1.0 76
Table 20. HTTP/l.I Header Fields 76
Table 21. Regular Expression Generator for a simple tokenizer 91
Table 22. Example of stoplist words 92
Table 23. An example ofN-Gram Stemming 93
Table 24. Example of Entropy Successor Stemming 95
Table 25. Example of inverted index 96
Table 26. Example of text for prefix tree creation 98
Table 27. Sample documents for Vector Space illustration 101
Table 28. Web Crawling Algorithm 120
Table 29. HITS Algorithm 131
Table 30. Authority scores for iterations of the HITS Algorithm 132
Table 31. The Hub scores for iterations of the HITS Algorithm 132
Table 32. Sample data to illustrate association mining 144
Table 33. Sample Web site ratings table 153
Table 34. Sample data to illustrate classification 154
Table 35. Data Samples to illustrate clustering .163
Table 36. Assignment of data samples to clusters after first iteration .164
Table 37. List of Transactions 173
Table 38. Classification Training data 173
Table 39. Sample clustering data for exercise 6 174
Contents
Table 40. Collaborative Filtering data for exercise 8 175

Table 41. Example of an SMTP session 181
Table 42. IFX gateway/service provider functional component stack. 195
Table 43. Example of an IFX document. 196
Table 44. WAP Protocol Stack 212
Table 45. "Hello world" WML example 215
Table 46. WML Example illustrating transitions from one card to the next.
............................................................................................................216
Table 47. Example of anchored text. 217
Table 48. Example of input collection and submission to backend server. 218
Table 49. Example XML Document... 226
Table 50. Example XSL stylesheet for generating WML. 226
Table 51. Steps involved in transmission of a SMS message to a mobile
device (GSM) 229
Table 52. Example WSDL document 247
Contributors
Dr. Ramesh Rangarajan Sarukkai is currently a senior architect at Yahoo!

Inc, and has worked at Lernout & Hauspie Inc., and IBM TJ Watson
Research Center. He has successfully led many projects to completion, and
developed products that are award-winning, and used by millions of users.
He holds M.S and Ph.D. degrees from the University of Rochester,
Rochester, NY, and B.E. degree from Visveshvaraya College (UVCE,
Bangalore University), India. Dr. Sarukkai's first paper on a novel approach
to automatic character recognition, appeared in the reputed journal Pattern
Recognition, based on his independent project during high-school. Dr.
Sarukkai continued R&D in many areas such as AI, speech recognition,
information retrieval, networking, wireless and web technology, and
published in leading journals/conferences such as Computer Networks, many
IEEE transactions, Neural Computation (MIT Press), Computer &
Graphics. and Pattern Recognition. In addition, Dr. Sarukkai holds many
patents (awarded and pending) in the above areas, including an early patent
on Web technology co-invented with Hewlett-Packard Labs in 1996. He has
served on various leading journals and conferences as a reviewer, and
working groups such as the World Wide Web Consortium's (W3C) Voice
Browser Activity.
Acknowledgements
Writing a book that encompasses a wide range of topics requires a lot of

feedback and constructive suggestions for improvements on various fronts.
Without the time and efforts of the reviewers and the numerous colleagues
who spent their valuable time to discuss, read, and edit my manuscript, this
book wouldn't exist in its current form.
Firstly, the anonymous reviewers gave useful feedback on the book

proposal. Next, I would like to thank the following for their time, effort and
valuable feedback on the contents of the book: Prof. Dana Ballard (Univ. of
Rochester) for his ever creative and insightful comments, Dr. Dave Raggett
(W3C/OpenWave) for meticulous editing and excellent points, Prof. Mark
Crovella (Boston Univ.) for constructive suggestions, Dr. Udi Manber
(Yahoo! Inc.) for useful guidance on topic selection and book writing in
addition to technical feedback, Dr. Sanjeev Dharap (Yahoo! Inc.) for many
discussions and comments, Raghuveer Chakravarthi (Yahoo! Inc.) for
architecture overview and discussions on Instant Messaging, and Kian-Tat
Lin (Yahoo! Inc.) for useful comments on Data/Web mining. I have also
benefited greatly from useful discussions with many at Yahoo! Inc. which
has influenced the contents of the book including Dr. Anurag Mendhekar,
Madhu Yarlagadda, Ash Patel, Sanjay Rao, Dr. Qi Lu, Venkat
Panchapakesan and other colleagues.
Without the blessings of God, and the strong support of my parents, I

would never be able to achieve anything in my life. I am deeply thankful to
my father Prof. S. K. Rangarajan for his constant guidance, and
encouragement to pursuing creative endeavours, never stop learning, and
Introduction
positive attitude. My mother Santha has always been supportive and

constantly pushed me to aim higher. I also thank my siblings Sekhar, and
Sundar for their advice. Sekhar had many insightful comments and
suggestions on my book. Sundar, the writer in the family, encouraged me to
seriously pursue writing a book, and I thank him for that (esp. his advice:
"your words are just around the comer. Just write them down!"). Last but
not the least, without the love, constant S\lpport, and encouragement of my
beautiful wife Ramya, this book would never exist. In addition to putting up
with late night writing schedules, Ramya has given insightful and practical
feedback on organization, and technical aspects of the book, and I am deeply
indebted to her for that.
The editorial staff at Kluwer have been very supportive and helpful with
my many simple formatting and editorial questions: I would like to
especially thank Sharon Palleschi, and Susan Lagerstorme-Fife for their
prompt responses and formatting support. I cannot list the many other
people, who have shaped or influenced my thoughts and thus the contents of
this book, but I thank them for that, and I apologize for any unintentional
omissions. While all the good in this book is attributable to the feedback,
discussions and support of the many people, any error or fault is my own.
The poetic quotations at the beginning of the preface and chapters are
from the work "Fireflies" by Nobel Laureate Rabindranath Tagore.
Preface
"Birth is from the mystery ofnight into the greater mystery ofday"
My idea of writing a book on the concepts that power the Web started in
1999. The huge growth of the Web has fuelled a variety of applications that
are being used by millions around the world. Despite the ups and downs in
the dot-com business world, Web technology solidified over the last decade.
The applications that power the Web derive their strengths from a diverse
range of technology such as information retrieval and mobile data access.
Most books on Web focus on programmatic aspects of languages such as

Java, JavaScript, or description of standards such as Hypertext Markup
Language (HTML) or Wireless Markup Language (WML). A book that
covers the concepts behind the infrastructure of the Web would be
indispensable to a wide range of audience interested in learning how the
Web works, how techniques in Web technology can be applied to their own
problem, and what the emergent technological trends in these areas are. This
motivated me to write a book that covered the "Foundations of Web
Technology" ranging from fundamental areas such as information retrieval,
data markup to applications such as web search, instant messaging, mobile
access and web services. I believe that this book would be useful for a
number of years to come since Web technology has matured considerably,
and the concepts discussed in this book will continue to be applied
universally.
Introduction
Audience
This book has been written to appeal to a wide range of audience. For a
person interested in understanding the basic concepts of Web technology,
this book covers the fundamentals and the techniques needed to build Web
applications. For the professional who has worked on specific parts of Web
or related technology, this book will provide a broad understanding of the
architecture of different applications on the Web, and how they relate to
each other. The techniques are discussed both from a conceptual level as
well as a practical level, so that the ideas discussed can be translated to real-
world prototypes. The pedagogical style of the book coupled with the
numerous examples, illustrations and exercises makes the content accessible
to a wide variety of audiences.
Course Textbook
This book is compelling as a course that covers the foundations on Web

technology. Each chapter has a set of exercises that cover both conceptual,
theoretical questions, as well as projects. The "Further Reading" section in
each chapter is a good point to go deeper into the topic covered in that
chapter. This book can also serves as a base for a seminar course on Web
technology. The book is written for any student with an engineering
background, although programming skills, and preliminary coursework on
computer science is preferable. This book is suitable for an engineering
student at a senior undergraduate or graduate level. Prerequisites for such a
course are basic undergraduate level computer (e.g. computer organization,
data structures), and math courses (e.g. calculus, vector-analysis).
PART I
FUNDAMENTALS
Chapter 1
Introduction
'Let me light my lamp', says the star. 'And never debate if it will help to remove the darkness'
1. WORLD WIDE WEB
The World Wide Web has grown phenomenally over the last decade,
Ranging from the growth of Web servers across the Web, the rapid adoption
of Hyper-Text Markup Language, to the availability of information in the
form of hundreds of millions of Web pages that are linked to each other, the
Web has changed the way in which information is accessed, how people
communicate with each other and how buying/selling activities are
accomplished with electronic commerce. With such rapid growth and
adoption of the Web, technology has been struggling to keep up with
emerging standards, highly competitive marketplace, and a maturing
technology. Over the last decade, Web technology has transformed into a
unique field of study. While the underlying technologies are distributed
systems, networking, information retrieval and security, the specialization
and application of these technologies to build scalable Web systems makes
the study of Web technology unique.
Web technology is pervasive, compelling and here to stay. The concepts

are general and extend beyond even today's World Wide Web. The
problems are not unique to the Web, but have been identified in other fields
such as distributed data representation, caching, and even pattern matching
technologies. Despite the ups and downs in the business world, the Web is
an integral part of our life, and the applications are undeniably indispensable
tools to society, in a manner much like the telephone a few decades ago.
Figure 1 shows an estimate of number of Web documents indexed by
popular Web search engines. It can be seen that the Web has grown from a
R. R. Sarukkai, Foundations of Web Technology

© Springer Science+Business Media New York 2002
4 Foundations of Web Technology
few million to billions of documents, and efforts continue to provide

accessibility to these billions of useful information distributed around the
world.
1600
1400
" .. I
",
1200 "., ,"'
1000 o #Web Pages
800 ',' .' indexed by
r,
600 Search Engines
400 '" (in millions)
200
oI -l"'Wnlr
1995 1997 1999 2001
Figure J. Growth of the World Wide Web.
2. CORE TECHNOLOGY
How does one build scalable Web applications? What are the underlying
technologies that make the Web work? What are the issues and problems
that need to be addressed? These are some of the questions that are answered
in this book. Unlike many other fields where research and development are
focussed on a specific problem or field of interest, the Web is derived from a
diverse set of technology. There is no one area of specialization that will
suffice to build Web applications. Rather, there are many underlying
technologies that make the Web work, each with specialized studies to
develop and tailor algorithms to each Web application.
3. WHAT'S COVERED IN THIS BOOK
The term "Web Technology" covers a wide variety of topics. The

fundamental topics identified and discussed in this book are data markup,
networking, and information retrieval. Data markup refers to the
standardization in representation of data, and methods for the transformation
from one data markup language to another. This is an important base for the
Web since the key to properly functioning distributed, collaborative systems
Ramesh R. Sarukkai 5
is a good representation system that can be formally defined, verified and

transformed.
The second fundamental area that is discussed in this book is networking.

Networking is a very broad subject, and the portions that are most relevant to
the Web are covered here. Concepts such as TCP/IP protocols, notion of
clients and Web servers, distributed caching and proxy servers, mechanisms
for achieving security and privacy, and protocols such as the Hyper-Text
Transfer Protocol are covered.
The third fundamental area covered in this book is information retrieval

and text processing. While the Web content consists of a large collection of
text and multimedia, current technology is dominated by textual access to
the Web. At the root of such systems is the field of text retrieval that
encompasses methods for processing, indexing and efficiently searching
large repositories of (textual) documents.
Other topics that may be considered fundamental include data

compression/encryption technology, and distributed database systems. Data
compression and encryption are specialized fields, and we felt those
algoritluns are beyond the scope of this book. Distributed databases play an
important role in the design and development of Web applications. However,
we believe that the study of these fields in isolation, and their advancement
in the context of other (non-Web related) applications are applicable to Web
development, and thus do not merit special mention in this book.
Now lets tum our attention to Web applications. The set of Web
applications discussed in this book include:
a. Directory & Search
b.Web Mining
c. Messaging & Commerce
d. Mobile Access
e. Web Services
Why did we limit our study to the above areas? Why did we not include
many other applications such as streaming and broadcast services,
personalization, listings, maps, auctions, media, finance, news, and business
to business. Each application has its own unique problem-specific issues and
technological hurdles. The applications discussed in this book are chosen to
be representative of web applications at large.
4. ORGANIZATION OF THE BOOK
This book is divided into two parts: fundamentals and applications. The
fundamentals part covers the basic technology that drives much of the Web
development. The fundamentals part consists of the following chapters:
• Data Markup
The chapter on Data Markup motivates the need for a standard
representation of data. Since the Web is a resource for a vast amount
of information, it is vital that this information is represented in a
format that is useful for exchange between businesses, vendors, or
even just clients. The extensible Markup Language has been defined
for this purpose. Another important aspect of data representation
systems is the need for transforming data specified in one form to
another, and the presentation of this data. The extensible Style Sheets
(XSL) approach to doing this is illustrated with examples. The
structural integrity of XML documents is maintained by defining the
corresponding "Document Type Definition (DTD)" or "Schema".
The Hyper-Text Markup Language (HTML) that fuelled the
widespread adoption of the Web falls into the same category as a
data markup language.
• Networking
Networking is the backbone of the Web. How do two end-points
identify each other on the Web? What protocols do they use to
communicate with each other? How do Web systems ensure security
and privacy between communicating parties? What is the
client/server architecture? What is a "proxy" and how does one
distribute and replicate data over the Web? What are some
communication protocols used in distributed cache systems? These
are some of the issues presented in the chapter on "networking".
Protocols such as TCP/IP suite, HTTP, SSL, and web security
methods are discussed in this chapter. Protection of privacy using the
Privacy for Platform Preferences(P3P) is also summarized.
• Information Retrieval
The third fundamental area that is an integral part of many Web
applications is information retrieval and text analysis. While
information retrieval is a broad subject with applications ranging
from text analysis to speech/multimedia indexing, the techniques
used for processing, indexing and retrieving documents from textual
databases are presented in this chapter. The steps involved in text
retrieval systems such as stopword elimination, indexing and search,

ranking, query expansion, and advanced techniques such as Latent
Semantic Indexing (LSI) are illustrated with examples in this chapter.
The second part consists of Web' applications. A subset of the vast

number of web applications have been chosen as representatives, and
discussed in the chapters in the second part. The web applications covered
include:
• Web Search & Directory

Web directory and search are the earliest applications that drove
users' to the Web. While a lot of the techniques are derived from
information retrieval, this chapter gives an overview of a Web search
system. How do systems crawl the Web to retrieve the documents?
How are these Gigabytes of crawled data indexed and stored for
retrieval? What are the issues in Web search, and how are they
addressed? How do you build useful Web directories? How do you
increase relevancy of the returned results? How can we identify
certain Web sites as being important than others, for instance using
the ''PageRank™'' algorithm or the "Hub-Authority" algorithm?
• Web Mining
The Web consists of hundreds of millions of documents, with
billions of page views, and keyword searches everyday. Furthermore,
it is possible to track users' interests, sites visited, goods purchased
online and various other information. Such information can be mined
using data mining techniques in order to determine trends, general
user interests, and enhance the personalized Web access features to
the end consumer. Of course, the most important aspect of Web
mining is ensuring that the users' privacy is respected, and the data
collected with the user's authority is protected from misuse. An
overview of data mining techniques ranging from association mining,
classification, clustering, and sequence matching is presented with
examples. How such techniques are applied to the Web is discussed
in this chapter on Web mining. Some of the applications discussed
include server log analysis, link prediction and recommendation
systems.
• Messaging & Commerce

Messaging and communication applications are highly successful
applications on the Web. Although e-mail was prevalent at the initial
stages of the Internet, its widespread adoption and use grew when it
was integrated with the Web, making it easy for people to manage,
and use their messaging facilities. Instant messaging and chat are
other prominent messaging applications that handle millions of
users'. Instant Messaging (1M) system is discussed to illustrate the
design issues involved. .
Commerce is another major area of application in the Web. E-

commerce opens up the huge opportunity (both for buyers and
sellers) to shop online, find the best bargains, negotiate for the best
pricing on auction sites, build online stores, and aggregate content in
shopping portals. What are the building blocks in e-commerce
platforms? What are the standards for enabling electronic
transactions? Such questions are covered in the chapter on
"Messaging and Commerce".
• Mobile Access
One of the trends that emerged in the last few years is the notion
of mobile access to Web information. Mobile Web (or "Wireless
Web" as its sometimes called) is a combination of access to Web
information and the notion of mobility. With wireless devices, users'
are able to be away from their computers, yet have access to the
information from the Web using wireless devices. How does the
integration of mobile and Web technology work? As an illustration,
the Wireless Application Protocol Suite is discussed along with the
wireless markup languages. Other wireless messaging techniques
such as Short Messaging Service (SMS) are also covered.
• Web Services
The last application discussed in this book is the emerging notion
of "Web Services". Although all the applications discussed in the
earlier chapters are services on the web, the term "web services" is
used to refer to a more generalized and formalized notion of
providing services on the Web. In the web services framework, the
main objectives are remote execution, platform independence and
ease of integration. In order to cater to such requirements, the web
services architecture utilizes the following abstraction layers: service
definition, service discovery, transport layer and the execution
environment. Universal Description, Discovery and Integration
(UDDI) registry enables the registration of service descriptions.
Service defmition is achieved using Web Services Description
Language (WSDL). At the transport layer protocols such as Simple
Object Access protocol (SOAP) is used to exchange information in a
distributed, de-centralized environment. Examples of web services

and issues in the development and adoption of web services are
covered in this chapter.
The final chapter is the conclusion· chapter that summarizes the topics
discussed in this book, and highlights some future directions in web
technology. The Appendix lists useful information for the actual
implementation of the exercises and projects. The glossary is useful to
lookup a set of frequent acronyms used in this book.
Chapter 2
Data Markup
Form is in matter, rhythm in force, meaning in person
Abstract: Structural specification of infonnation is of paramount importance to the Web.

Since information exchange and processing is an central aspect of Web
applications, it is essential to have standardized representation and
specification of data, in addition to mechanisms of transforming data from one
representation to another. eXtensible Markup Language (XML) is a structured
language for defining document structures, and extensible Style Sheet
Transfonnation (XSLT) is a language that enables the specification of
transformations from one XML language to another. Application of style
sheets to presentation of documents is also discussed. The Hyper Text Markup
Language (HTML) that fuelled the growth of the Web is presented.
Keywords: Data representation, extensible Markup Language (XML), extensible Style

Sheets (XSL), extensible style sheet transformation (XSLT), data
transformation, business to business (B2B), HyperText Markup Language
(HTML)

1. INTRODUCTION
1.1 Communication of information
Structural description is an important aspect in the communication of

information. Even in the evolution of human communication, various forms
of written and spoken languages exhibit well-defined syntactic rules. For
instance, the earliest form of writing dates back to the ancient man: cave
drawings called petroglyphs created over twenty thousand years ago in parts
of Spain. What is fascinating is that the drawings are interpretable even
today. This is possible due to the pictorial nature of the cave drawings.
Communication in this form of writing evolved through pictograrns where
the actual intent was identified by a direct image of the object in question.
As the need for complex forms of writing increased, abstract representations
of objects were invented, and the notion of ideograms emerged. Sumerians
developed one of the earliest forms of cuneiform writings. Soon notions such
as word writings and syllabilic writings emerged (e.g. from the Persians in
600-400 b.c.). Over the centuries, writing systems evolved into the modern
form of alphabetic, cursive writing, and now to communication between
machines across networks through coded digital signals.
An important observation that can be made in the history of human

communication are the notion of clearly defined structure and linguistic
rules, the ability to transfer these rules, and mechanisms of translating from
one writing system to another. Another layer of abstraction is the actual
presentation style of the written material, such as cursive styles of writing.
At some level, such principles do apply to modern technology such as the
Web. The World Wide Web consists of a vast amount of information
distributed around the world in different forms. An integral requirement of a
large-scale distributed information resource is the ability to communicate
and present this information from the Web to other systems or users.
1.2 Layers of Data abstraction
Information transmission across the Internet can be abstracted at various

layers. At the lowest layers, information travels across the wires as analog
signals, which are digitized into binary states (binary digits or bits). Binary
digits are packaged into groups of eight to form bytes, and operated upon by
microprocessors. Data can be encoded into different formats such as ASCII
(which represents codes for letters in the English alphabet along with
alphanumeric and special characters) and Unicode (which is a representation
for many international languages). At the next layer of abstraction, the

information can include other structural aspects of the information that's
transmitted. This is illustrated in Table 1 below.
r.abilL
e . ayers 0 fd ata abstractIOn
Abstraction Layer (XML)
Coding Layer (ASCIIlUnicode)
Bytes
Bits
Signals
2. DATA MARKUP
2.1 Separation of structure and presentation
The primary motivation of data markup languages is the separation of

information from the structure and presentation, as illustrated in Figure 2
below.
Structure Presentation
Figure 2. Information - Structure - Presentation Separation
The separation of structure and presentation from the actual content of

the information is best elucidated with an example. Consider the following
statement:
Albert Tan M.D. lives at 1234 Main Street, New York, New York.
The sentence contains information about "Albert Tan": professional

degree and residential address.
The same information can be structurally decomposed as follows:
Name: Albert Tan

Professional Degree: M.D.
Residential Address: 1234 Main Street, New York, New York.
The residential address can be further decomposed into street, city, and
state.
Residential Address:
Street: 1234 Main Street
City: New York
State: New York
The structural hierarchy can be represented graphically as follows:
Infonnation
Figure 3. Graphical representation of information structure.

Next, let us consider the rendering of this information. One format may
be to highlight the name in bold font, and underline the address.
Another format for this sentence is shown below:
It should be apparent from the above examples that information can be

decomposed structurally, and this information can be rendered differently
based on the required rendering "style'.
2.2 Need For Data Transformations
Another important aspect that drives the applicability of data markup is

the need to translate from one data representation to another. This is
especially true in the business-to-business transaction (B2B) world.
Company A manufactures and maintains its inventory in a particular format.
Company B acts as an "agent" in selling goods to the end user, by billing the
customer, and passing on the shipping address to company A. Company A
and B need to communicate with each other to ensure that the products are
available, and the appropriate billing requirements and charges occur. With a
structured data markup mechanism, B can transform A to a format it can
process, use the data and then generate a resulting data that A can transform
back to its own format. The power of data transformations is very apparent
in the business to business world. Additionally, we will discuss some of the
commerce standardization efforts in a later chapter.
2.3 History of Markup Languages

Data XML
Structure
HTML
Display
SGML
Printers
RTF
1970 1986 1991 2001
Figure 4. History of Markup Languages
Figure 4 summarizes the timeline of some popular markup languages.

The primary applications of markup languages were formatting document
text for printing and publishing. Towards this end, Microsoft's Rich Text
Format allowed the specification of size and attributes of text within
documents. In 1986, work from IBM resulted in the formal specification of
the Standard Generalized Markup Language (SGML). SGML defmed a
systematic method of describing document structure. This was a major step
away from tying the markup with rendering on a specific device and making
it more relevant to the document itself.
SGML introduced the notation of "start" and "end" tags. For instance, a
particular structural component (e.g. title) is specified by embedding it in
angled brackets as <title>. All text following this start tag will be classified
as a part of that structural component. The end of this component is denoted
by embedding the tag between "<I" and ">", as in </title>. In the example
below, the title is specified by the text "Introduction to Information Theory":
<title> Introduction to Information Theory </title>
A few years after the advent of SGML, Hypertext Markup Language

(HTML) was introduced and fuelled the Web revolution. During the 1990s,
the importance of formal specification of such languages was realized,
which led to the development of the eXtensible Markup Language (XML),

fonnally published by the World Wide Web Consortium (W3C).
3. EXTENSIBLE MARKUP LANGUAGE (XML)
3.1 What is XML?
Extensible Markup Language is a language specification that allows the

definition of other markup languages. In a sense, XML can be thought to be
the mother of all other markup languages. As mentioned earlier, it is
important to separate content from presentation, and XML provides a
mechanism of achieving this separation.
The key aspects of XML are:

• Standard language for definition of other languages.
• XML documents can be checked to see if they are valid documents.
• XML allows the hierarchical organization of infonnation structure.
3.2 Tags
The core components of XML documents are "tags", or more often

tenned "Elements". These basic building blocks allow the specification of
the structure of the presented data, in addition to additional infonnation
associated with the "tagged" data.
Let us go back to our earlier example:
Name: Albert Tan

Professional Degree: M.D.
Residential Address: 1234 Main Street, New York, New York.
Let us call this whole infonnation structure as "EmployeeRecord".
Now, we can rewrite the above infonnation as follows:

<EmployeeRecord>
<Name> Albert Tan <!Name>
<Degree> M.D. </Degree>
<Address> 1234, Main Street, New York, New York </Address>

</EmployeeRecord>
One can notice many aspects in the above XML format of the
information. Each element is enclosed in "<" or ">" symbols. If the
beginning or start of the tag is specified, then the tag is enclosed in "<", and
">". If the end of the tag is specified, then it is enclosed in "<" and "I>".
Information is specified between the relevant start and end tags.
3.3 Defining Hierarchies
The important concept in XML is the notion of hierarchy or structure.

Note that in Figure 3, we demonstrated a graphical structure of information
that was structurally decomposed into a hierarchy. The hierarchy in the
example shows "information" to be the root of the tree. "Name", "Degree"
and "Address" are children of the root node. Address node is further
decomposed into "Street", "City" and "State".
What the hierarchy imposes on the structure of the XML document is

that child elements of a node can only be embedded within the tag for that
node. Thus, "Name" information can be nested within a "EmployeeRecord",
and not outside it. Similarly Address tags cannot be embedded within the
degree tags. If the Address tag can be divided into "Street", "City" and
"State" tags, then the address information can be decomposed as follows:
<Address>
<Street> 1234, Main Street </Street>
<City> New York <ICity>
<State> New York </State>
</Address>
XML allows the definition of such hierarchies, and ensures that the
hierarchy is adhered to.
3.4 Defining Documents using DTD
Until now, we have defined the notion of hierarchy, tags, and how XML
documents adhere to the structure imposed on them. But how does one
impose or define the XML structure? In order to define the structure of the
XML documents, the Document Type Definition must be defined and
specified for each XML document. An alternative specification is the
Schema that is discussed in a later sub-section.
The Document Type Definition (DTD) allows the specification of the

tags that are allowed in the defined XML language, the hierarchy or ordering
of those tags, any special information that needs to be passed or associated
with each tag, and another program or character directives. This DTD must
be specified or embedded within each XML document, which will enable the
validation of the XML document using the appropriate DTD.
We can expand our example to include a DTD as shown in Table 2:
Table 2. Example XML Document
<?xml version="l.O"?>
<!-Document Definition Starts here -->
<!DOCTYPE EmployeeRecord [
<!ELEMENT EmployeeRecord (Name I Degree I Address)+>
<!ELEMENT Name (#PCDATA»
<!ELEMENT Degree (#PCDATA»
<!ELEMENT Address (#PCDATA»
]>
<!-Document Definition Ends here -->


<EmployeeRecord>
<Name> Albert Tan </Name>
</EmployeeRecord>
The first line of the XML document is the XML processing instruction
that specifies the version ofXML used. Additionally, other information such
as the character encoding used in the document may be specified. All
commands start with <! and end with >. The DOCTYE allows the definition
of the document definition embedded within the XML document. In this
example, the document is defined to contain the tags EmployeeRecord that
contains the tags Name, Degree, and Address. Following the document
definition, the actual XML document content follows.
Following the DTD, the actual XML content follows, starting with the
root start tag. In this case, the root element is the "EmployeeRecord". Within
the EmployeeRecord start and end tags, the Name, Degree and Address
elements are embedded. Note that XML is case sensitive, and so one cannot
mix lower-case and upper-case start and end tags. Furthermore some special
characters need to be escaped: such as < to &It; > to >, & to & and
so on. Comments can be embedded within  as shown in the
example.
The DTD can also be referenced using a Uniform Resource Identifier

(URI, see chapter on networking), in which case the DTD is not embedded
within the XML document as shown in Table 3:
Table 3. Example XML document with external DID
<!DOCTYPE SYSTEM "ER.dtd">
<EmployeeRecord>
</EmployeeRecord>
The content of the Document Type Definition document ER.dtd are

shown in Table 4:
Table 4. Contents of the example DID
<!ELEMENT EmployeeRecord (Name I Degree I Address»

]>
3.4.1 Defining Elements
The syntax of defining elements in a DTD is illustrated in the example

XML document (see Table 2). One can note that the ELEMENT definition
contains three columns: first is the ELEMENT directive, second is the name
of the tag, and the last is the content of the tag. The content can be a rule on
other tags, or can specify certain types of data such as PCDATA. PCDATA
refers to Parsed Character data, and does not contain unexpanded markup or
entity references (see next subsection). In our example, the tag
EmpIoyeeRecord is defined to contain (Name I Degree I Address)+. This is
a regular expression that allows one or more tags (of the type Name or
Degree or Address) to be embedded within the EmployeeRecord tag. Other
content types include ANY and EMPTY. ANY type allows content that is
composed of any mixture of elements in the DTD or character data. EMPTY
refers to tags that do not have any content associated with them.
For instance, the element "Student" can be defined EMPTY as follows:
<!ELEMENT Student EMPTY>
and can be referenced in the XML document as:
<Student!>
or
<Student> </Student>
3.4.2 Attributes
It is often useful to associate information pertinent to certain elements.

Let us go back to our example with the Employee Record. The address
element can be associated with other information such as the date of
residence at that address. In XML this "attribute" of the Address tag can be
denoted as follows:
<Address ResidenceSince="O1/0 1/2000">

1234, Main Street, New York, New York
</Address>
The name of the attribute denoting the date of residency is "Residence

Since". It is clear that this attribute is associated with the tag Address and the
appropriate data encapsulated by that tag. Attributes are listed in the start
tags within the "<" and ">" symbols, and multiple attributes are separated by
space:
<Address ResidenceSince="O1/01/2000" ResidenceType="Apartment">

1234 Main Street, New York, New York
</Address>
In the above example, the element "Address" has two attributes:

ResidenceSince and ResidenceType. Next, let us illustrate how attributes are
defined in the Document Definition:
Table 5. Example defining attributes for an element.

<?xrnl version="1.0"?>

</script>
</head>
</html>
Once the browser processes this page, it executes the script on the client
side, and generates the page:
<htmI>
<head> </head>
<body> <hI> Welcome to the test page </hI> </body>
</htrnl>
Scripting is especially useful for checking form input validity, and event
handling such as events on submission of form, motion of mouse over a
button, or click of a button.
7. CONCLUSION
In this chapter, we have motivated the need for data representation and
transformation. This enables the separation of document content from
presentation. XML defines a standard framework with which general data
markup languages can be defined. The XSL transformation provides a
general approach for the transformation of documents from one XML
representation to another. The Hyper Text Markup Language is a text
markup language that allows transitioning from one document to another
through links. The key aspect of XML is the structured representation of
documents using Document Type Definition or Schema, which can then be
used to validate the XML document. Key elements of HTML are discussed
with examples. Data markup is an integral part of the Web since a vast
amount of information needs to be communicated across the Web from
business to consumer or business to business.
FURTHER READING
The World Wide Web consortium is a useful source of standards relating

to XML (http://www.w3c.org/xml). In addition, many books such as
[Mercer 2001] detail the XML concepts discussed in this chapter. Hyper
Text Markup Language (HTML 4) is well covered with examples in
[Raggett et al 1998]. XSLT is covered extensively in [Kay 2001], and
Cascading Style sheets are discussed in detail in [Meyer 2000]. XHTML is
extensively covered in [Boumphrey et al 2000].
EXERCISES
1. What are the three main advantages of eXtensible Style Sheets?
2. What are the components of an XML system?
3. Detail the differences between data, representation, presentation and

transformation with examples.
4. Which of the following are well-fonned XML?

a. <FRIEND> John Smith </friend>
b. <Friend>
<First Name> John </First Name>
<LastName> Smith </LastName>
</Friend>
c. <Friend> <Name> John Smith </Friend> </Name>
5. Write a DTD that constitutes the following hierarchy:

• Each Car Dealership has many brands of cars
• Each brand of car has a name and model type
• Model type has date field and model name field.
• Each brand of car has an attribute that specifies if it's an "used" car or
"new" car.
a. Draw a structural graph that represents this hierarchy.
b. Write the DTD for this structure.
c. Write an XML document that confonns to the DTD defined in (b)
that has at least three car types at two dealerships.
6. Convert the DTD in (5) into a schema.
7. An online car vendor processes car listings in the following structure:

a. Name of dealer.
b. Car Brand Name and type
c. Price and count of available cars
Write a DTD to reflect this document structure.
8. With examples, demonstrate how a style sheet can transfonn the data
specified by a DID in (5) to a data that is in the XML fonn that is defined in
(7).
9. Write a HTML page that displays the data available for the online vendor
in (7).
Ramesh R. Sarukkai
Chapter 3
Networking
To the blind pen the hand that writes is unreal, its writing unmeaning
Abstract: What are the underlying protocols used by the Internet? The Internet Protocol
addressing scheme for representing endpoints is discussed, along with the
TCPIIP protocol suite. The purpose of Distributed Name servers, and the
notions of URLIURI are presented. The notion of Web servers, web proxy,
clients, and how they communicate with each other using HyperText
Transmission Protocol is discussed. Secure communication is an important
aspect of communication on the Web, and Secure Sockets Layer, Public Key
Infrastructure, and certificate authority notions are illustrated.
Keywords: Internet Protocol, IP, DNS, TCPIIP, URI, URL, HTTP, Web servers, clients,
Web proxy, SSL, HTTPS, digital certificates, Public Key Infrastructure (PKI).

1. INTRODUCTION
Internet was born out of the ARPA Net project funded by the US Defence
to study the inter-linking of data package technologies across a network of
computers. This project was initiated in 1973, and the research from the
ARPA Net emerged into today's Internet. Over the last three decades, many
technologies and standards have made the Internet possible, such as
Transmission and Internet Protocols (TCP/lP). The power of the Internet was
the ability to connect two different computers on a distributed network. Soon
applications such as File Transfer Protocol (FTP), and Mail emerged and
were adopted quickly. The flexibility of the Internet was the ability to
develop arbitrary communication applications which led to the emergence of
the World Wide Web. The Internet Protocol (IP) protocol named end-points
with four byte location identifiers. The Distributed Name Server architecture
enabled referencing endpoints using an alphanumeric naming scheme.
In this chapter, the Transmission Control Protocol and the Internet

Protocol (TCP/lP) suite is discussed. How end points are identified, and the
mapping of IP addresses to domain name addresses is discussed. Next, the
Hyper Text Transfer Protocol (HTTP) that drove the growth of the Web is
presented. How the Web is distributed in clients and servers, and how proxy
servers can help build a distributed server base is examined. Security of
communication is covered, and notions of public and private keys, certificate
authority and secure sockets layer (SSL) are detailed.
2. LAYERING OF NETWORKS
The mechanism of transmitting a data segment from one computer to

another involves a number of steps ranging from adding data packet
identifiers, counts, error check sums, routing information, source and
destination addresses. Such a complex series of operations on the data is
achieved by each component by applying "transformations" on the data that
is being transmitted from the source to the destination computer. A common
approach to handling this complexity is by decomposition of the networking
protocols in layers. The International Standard Organization's Open System
Interconnect (ISO/OS!) model defines a set of seven layers:
• Layer I: Physical
• Layer 2: Data Link
• Layer 3: Network
• Layer 4: Transport
• Layer 5: Session
• Layer 6: Presentation
• Layer 7: Application
The physical layer defines the cable or physical medium of transport. The
data link layer represents the actual format of the data on the network, such
as inclusion of checksum of payload data being transmitted, and source and
destination addresses. The physical and logical transport to the destination is
handled in the data link layer using a network interface. The Network layer is
another layer of abstraction above the data link layer, and protocols such as
Internet Protocol (IP) are popular standards used in this layer (discussed later
in this chapter). The transport layer maps the user-buffer to be transported
into "data packets" of network size. Additional transmission control logic is
implemented in this layer, and examples of protocols (such as TCP and UDP)
are also discussed in a later section. The data format sent over the connection
is specified in the session layer. Conversion of data from local to external
data formats is achieved in the presentation layer. Lastly, the application
layer provides services (such as mail access, file transfer) to end-users.
3. LOCATING ENDPOINTS
3.1 IP
Every interface point on the Internet is uniquely identified with its

address, called the Internet address (or IP address). IP addresses are 32-bit
addresses written as a sequence of four bytes separated by".", such as
128.0.0.0. Internet addresses are separated into five classes: A, B, C, D, and
E. Class A addresses are for large networks, class B for medium sized
networks, class C for small networks, class D for multicast addresses, and
class E generally for experimental use. It is important to note that IP
addresses can also be assigned in a dynamic manner. For instance, many
services and corporate networks often economically use a pool of IP
addresses that are dynamically assigned to users. Each transmission at the
network layer using the IP protocol requires the packaging of data into "IP
packets" or "IP datagrams". The IP datagram consists of a 20-byte basic
header followed by additional optional information (options), and the actual
data that is to be transmitted. This is summarized in Table 14. The 8-bit
protocol field specifies the protocol: a value of 17 is UDP, value of 6 is TCP
etc. The header checksum is a checksum computed on the header only. The
source and destination IP fields specify the source and destination addresses.
Table 14 IP Datagram
Vers
ion
I Length
Header I Type of
Service
Total length of packet (in Bytes)
l6-bit identification Flags I Fragment offset

Time to Live I Protocol Header Checksum
Source IP Address (32-bit)
Destination IP Address (32-bit)
Options (if any)
DATA
An important aspect of the Internet Protocol (IP) is its lack of connection

state and the absence of reliable delivery at destination. In fact, the time to
live field determines how long the IP packet will be alive in the network. IP
packets are transmitted from source to destination using hardware called
routers. Routers use various algorithms to determine how to pass this IP
packet from source to destination. In the simplest case, the source and
destination are in the same network, and the IP datagram can be sent directly
from source to destination. If not in the same network, then the IP datagram
is sent to the nearest router to be forwarded. Each router has a table that
indicates where to route a packet if one is received. If the destination address
is the same as the current address, then the packet is delivered to the
appropriate subsystem. If not, it is routed to the appropriate address as
determined by the table. If there is no information on how the packet is to be
routed, then the packet is discarded. It is important to note that delivery ofIP
packets can fail due to a number of reasons such as router failure, or
congestion. The checksum stored in the IP packet header is that of the IP
header: thus, any errors in the payload data will go undetected at the IP level.
Four byte addresses defined about twenty years ago by IP (version 4) is

not enough. With the rapid growth in IP address usage, it is predicted that we
will run out of IP addresses needed by more computers, in the next few
years. In order to overcome this limitation, new standards of IP addressing
are being defined and called the IP-next generation (IPng) initiative. The
current standard is the IPv6 which allows 128-bit IP addresses, an address
length four times that of the 4-byte address space. In addition to increased
address space, IPv6 includes additional security by providing packet
encryption and source authentication. IPv6 also supports auto-configuration
modes, and is designed for more real-time applications such as video
streaming.
3.2 DNS
It is often easier to reference endpoints using alphabetic names than

numeric addresses. The Domain Name System was developed for this
purpose. DNS allows domains to be names using alphabetic strings which
then get mapped to numeric IP addresses, and to provide email routing
information. DNS stores this information in the form of a hierarchical,
distributed database. A name server performs the task of maintaining and
translating alphanumeric domain names to IP addresses.
The top level domains of DNS include com, edu, gov, net and org. Each
of these top level domains are dedicated to specific organizations: edu covers
educational institutions, gov covers U.S. governmental agencies, com covers
commercial organizations. The top level domains are followed by secondary,
and higher level domains. The domain name is derived by going from the
high level domains down to top level domains, for instance, cs.rochester.edu.
Domain names that complete with a period are called Fully Qualified Names.
A zone is a sub-tree of DNS that is managed separately. For instance, all

names under the Rochester.edu may be administered separately, and called a
zone. Once the authority for a zone has been designated, it is the
responsibility of the zone administrator to provide name servers for that
zone. The name servers can be distributed, and fault-tolerant by having a
primary name server for that zone, and secondary name servers which get
their information from the primary name server periodically.
When a name server does not contain information regarding a particular

domain name address, it contacts another name server to retrieve that
information. Thus, the DNS is a distributed system. Furthermore, DNS

servers cache information received regarding previously requested domain
names. Communication between domain servers takes the form of DNS
messages that get sent back and forth to exchange information. This
communication can use different protocols, but UDP is more commonly used
(see later section for description ofUDP).
A DNS message can include a set of questions or a set of answers. Each

question takes the form of a 32-bit query name (which is the name of the
domain whose IP address is required), query type and class. The query type
indicates what the query is for; for instance, type 'A' indicates the query is
for an IP address. Query response or an answer also includes a domain
name, type and class. In addition, answers include time-to-live, resource data
length, and the actual resource. Time-to-live indicates how long this
information can be cached. The resource can be an IP address or other
information depending on the type.
One of the drawbacks of the DNS is that it is intrinsically tied to the

English alphabet, and was not designed with other international languages in
mind. Furthermore, the convenience of using alphanumeric addresses is not
without a computational cost. Each request requires that the domain name be
mapped to the IP address. This is achieved by contacting the domain server
that is the authority on the requested domain. Since subnets can be handled
by different name servers, multiple name servers may need to be accessed
before the IP address is retrieved. Repeated invocations of such mappings
can significantly slowdown access to Web documents. Caching of IP
addresses at various name servers is generally done in order to speed this
process, but the danger is that the cached value may have an old incorrect IP
address for that domain (till the entry expires). Finally, the design inherently
involves "information propagation delays" between name servers.
3.3 URLIURI
URI is an acronym for the Universal Resource Identifier that is used to

identify any resource on the World Wide Web. A URL or Uniform Resource
Locator is a subset of URI that expresses an address which can be mapped to
a network address. The URL can consist of a (optional) prefix, a scheme,
protocol part, optional user/password information, domain· name, port
number and path. For instance, In the URL

http://www.yahoo.com/home.html. the scheme is the Hypertext Transfer
Protocol (http), the user name or password is not needed for this protocol,
www.yahoo.comis the domain name, and /home.html is the path. Other
schemes supported by the URL convention is File Transfer Protocol (ftp),
gopher, email, Usenet news, and telnet.
Ports are numbers that enable different applications on the same computer
to communicate separately. Thus, ports can be viewed as entry points for
logical connections. Port numbers generally range from 0-65535, and are
divided into three types:
• Port numbers 0-1023 are well-known ports

• Port numbers 1024-49151 are registered ports
• Rest of the port numbers are dynamic or private ports.
Some common examples of well-known port numbers are 80 (used for

HTTP), 21 (used for FTP), 22 (for SSH), and 25 (for SMTP).
4. TRANSMISSION PROTOCOLS
4.1 TCPIIP Suite
TCP/IP or the Internet protocol suite is a protocol suite that allows

computers to communicate with each other connected in a computer network.
TCP/IP suite spans different networking layers including application,
transport, networking and link layers.
At the application layer, the TCP/IP suite has utilities such as telnet, ftp
(file transfer protocol), SMTP (Simple Mail Transfer Protocol) for e-mail and
so on. For transport, UDP and TCP are two methods of transmitting data. At
the network layer, IP is used commonly, while other protocols such as
Internet Control Message Protocol (ICMP) and IGMP (Internet Group
Management Protocol) are also defined. At the link layer, protocol such as
ADR (Address Resolution Protocol) is defined to map the IP addresses to the
data link layer.
4.2 TCP
Transmission Control Protocol (TCP) was designed to reliably transmit IP

packets from source to destination. TCP requires that a connection be
established between the client and server before data can be transmitted. TCP
is reliable since TCP maintains packet delivery information. In order to
verify that the transmitted packets are received at the receiver, TCP packets
are numbered, and missing packets are detected at the receiver and
retransmitted by the sender if there is no acknowledgement received within a
fixed interval of time. TCP was designed with the underlying goal of
infrequent, reliable transmission of large blocks of data.
TCP packets are on top of IP, so TCP header and data information are
embedded within IP packets as shown below:
Table 15. IP Data

IP Header TCPData
A TCP segment contains the following:

• 16-bit source and destination port numbers
• 32-bit sequence number
• 32-bit acknowledgement number
• Header length
• 6 bit flags
• 16-bit window size
• 16-bit checksum
• 16-bit urgent pointer
• options and data (if any)
Each TCP segment contains the 16-bit source and destination port
numbers. Sequence numbers are also transmitted in every packet. At the
onset of a new connection, the SYN flag is set to 1 and the sequence number
field is used as the starting sequence number. On following transmissions,
the sequence number represents the byte number in the transmission of data
for this connection. Since TCP is full-duplex, the numbering is different for
the simultaneous flow in each direction of the connection. The
acknowledgement number field is used to specify the next sequence number
that should be present in a valid acknowledgement. The checksum field is

computed on the TCP header and data only. Options include other
information that the client and receiver want to exchange such as the
Maximum Segment Size that the receiver can handle.
The 6 bit flags are:

• URG - The urgent pointer IS valid. This IS used to send
emergency data.
• ACK - Acknowledgement number is valid.
• PSH - Pass this data as soon as possible.
• RST - reset connection.
• SYN - synchronize sequence numbers.
• FIN - finished sending data.
The first stage ofTCP is a three-way handshake:

- Client sends a SYN TCP segment with the destination port number, and
the initial sequence number.
- Server sends a SYN TCP packet, while sending an ACK with the clients
SYN number + 1.
- The client now acknowledges this SYN with an ACK with the server's
sequence number + 1.
Since TCP is a full-duplex connection, it requires four TCP packets to

close both ends of a TCP connection- one FIN each way followed by the
corresponding ACK. It can be noted that TCP provides a "half-close"
connection where the connection is open only for the flow of data from one
end to another, and not vice-versa.
Various factors influence the performance ofTCP:
a) Retransmission delay - The amount of time that the sender waits for
acknowledgement before retransmitting the packet. If this threshold
of time has passed without an acknowledgement, then the packet is
assumed to be lost. However, the delays over the Internet are
unpredictable, and this delay could vary for different subnetworks.
b) Socket interval - Sockets are typically unused for a period of time
after they have been closed. Setting this interval (also termed as
CLOSEWAIT_INTERVAL) to a low value would result in delayed

packets from the previous connection to be received incorrectly. If
this delay is set too high, then that socket cannot be used for that
time.
c) Receive Window - Receive window refers to the amount of data that
when not received, will not result in an acknowledgement. For
example, if the Receive Window is 64Kbytes, then the sender can
send at most 64KB without receiving an acknowledgement from the
receIver.
Other factors that influence TCP performance are maximum segment size
that can be transmitted, and the keepalive interval. The latter corresponds to
the maximum time that the two ends of a TCP connection should be
maintained even though there is no data flowing between them. The
Maximum Segment Size (MSS) is the largest data size that the TCP segment
can send. As mentioned earlier, this can be configured as an option in the
first SYN TCP packet. Large MSS are generally better for large data
transmissions (unless there are fragmentation issues).
How does TCP do flow control? The key aspect of TCP is that the
receiver acknowledges receipt of a data packet to the sender. If the sender
fails to receive a threshold of packet acknowledgements from the receiver,
then the sender assumes failure and resends the data packets. While this is a
valid solution in some cases, it amplifies the problem in other situations. For
instance, if a link is congested, and the router drops packets, then the receiver
will not send acknowledgements for those dropped packets. The source in
tum will detect that the acknowledgements have not been received and
resend the data packets making the congestion worse. Two variables
congestion window, and receiver window are used in conjunction to control
rate of transmission. A TCP must not send data with a sequence number
greater than the sum of the highest acknowledged sequence number and the
minimum of the congestion and receiver window variable values. Thus
packet drop can be used as an indicator of congestion, and the delay and
absence of ACKs can be used to actually reduce the number of data
transmissions to the receiver. However, such congestion control schemes
have drawbacks: estimating congestion by retransmission may not be
meaningful in certain cases such as wireless networks.
Let us summarize what TCP offers in addition to the IP layer. TCP directs
packets to different ports on the destination for different processes to use.
TCP verifies the validity of the data being transmitted by computing the
checksum on the message. TCP uses ACK's and re-transmit procedures to
detect possible delivery failures, and guarantee delivery of the packet, in
addition to sequential ordering of data segments. By "guaranteed", we do not
mean that using TCP will never fail. TCP can deliver the data only as long as
the connection holds up. For instance, in cases where clients are assigned
dynamic IP addresses, TCP can break down if the connection goes down,
even if the client immediately reconnects.
4.3 UDP
The User Datagram Protocol (UDP) is also built on top ofIP at the transport
layer. However, unlike TCP, UDP is connectionless and does not support
reliable delivery of the packet from the source to the destination. Since UDP
does not have the overhead of reliable delivery that TCP has, UDP has better
performance. Typically, when UDP is used, the application layer is
responsible for reliable transmission of UDP packets. UDP is a datagram
oriented protocol. UDP applications send UDP packets to the destination,
and there is no acknowledgement at the transport layer. Thus, it is easy to see
why UDP is unreliable: a packet may be sent, but there is no way of knowing
if it was actually received, and if it needs to be resent. UDP packets are built
on top ofIP, including a UDP header and UDP data.
ams.
UDPData
The UDP header consists of the following fields:

• 16-bit source and destination port numbers.
• 16-bit UDP length
• 16-bit UDP checksum
• Data
UDP checksum is over the UDP header and data, while the IP datagram
checksum is over the whole IP packet. The checksum is optional in UDP,
unlike TCP where it is required. If the checksum verification fails at the

receiver, then the packet is silently discarded.
5. CLIENT/SERVER
5.1 Architecture
We have seen how "endpoints" are defined for the Internet.

Communication between endpoints results in the flow of information across
the Internet. A basic structure of the Web is the presence of special software
called "servers" at some specific endpoints. "Servers" serve information to
other requesting endpoints called "clients". The client-server architecture is
not specific to the Web, but is applicable to any distributed system. Figure 8
illustrates the client-server architecture.
Client 1
Web
Client 2 Server
Figure 8. Client Server Architecture
The server typically listens for incoming requests at a port. A client that
needs information served by that server connects to that server, and requests
the appropriate information. This request/response protocol is generally
HTTP or HTIPS for the Web (these protocols are discussed in later
sections). The server processes the request, and returns the information back
to the requesting client. This server "processing" can include checking of a
database, contacting another server, and computation using algorithms to
provide desired functionality. For the Web, the server generally needs to
handle requests from a large number of simultaneous clients.
5.2 Proxy Servers
It is often necessary to distribute the load of client requests across a

number of servers. The simplest way of achieving this is to replicate the
servers, and have a mechanism of redirecting requests from clients to
different individual server replica. If there are common resources (such as
databases) that limit the amount of replication, then these servers should be
able to access such shared resources.
Another alternative is the notion of "proxy servers". Proxy servers, as the

name suggests, are servers that do not have any original information. Rather
they act as intermediary between the client and the server that actually posses
the information requested by the client. Proxy servers can store a copy of the
response to the client request or forward the request to the original server'.
Original
Client reques~
Proxy ...
Forwarded .:tq U Sl
Server
... y ...
A .......response ""'response
" X
Figure 9. Illustration of a proxy server scenario.
Figure 9 shows an illustration of the flow in the proxy-server scenario.

Client A connects to the proxy server Y and sends its request. The proxy
server forwards this request to the original server X. This can also be
achieved by redirecting/forwarding the HTTP request. X processes this
I We use the tenus "origin/original/source server" to refer to the server that is the origin for
the requested resource.
request and sends the result back to proxy Y. Proxy Y, in turn, returns this
response to client A. Clearly, nothing is gained if this is repeated for every
request, since ultimately every client request needs to be processed by the
original server. In practice, a number of the client requests can be stored or
cached for a certain period of time. This eschews the need to forward the
request to the original server every time, thus reducing the request load on
the original server. Thus, in the example above, if client A resends the same
request later to proxy Y, then proxy Y can check its local "cache", and return
the earlier saved response if its still valid. This eliminates the need to contact
the original server for this client request. The price that one pays for using a
proxy is the added complexity involved in maintaining a copy of the data
and ensuring the validity of that data.
The main advantages of proxy servers are:

a) Better client load distribution.
b) Can have additional security by isolating clients from the original
server.
c) Performance gain through effective caching at the proxy.
Extensive research has been done on analysis of web server workloads (e.g.
[Barford & Crovella 1999]). In general, statistical properties of web client
workloads tend to exhibit high variability. Statistical properties include
properties such as file sizes, transfer times, and request inter-arrival times. A
commonly cited model to represent the relationship of documents and their
frequency of use is the Zipf distribution. Zipfs law [Zipf 1949; Mandelbrot
1983] was used to model the relationship between a word's popularity in
terms of its rank and its frequency of use.
If P is the frequency of use, and p is a measure of the popularity rank of the

document, then Zipfs law states that:
P -- p.p where 13 is typically near 1

How useful is caching at the proxy? Experimental results vary, with
researchers citing performance gains from 30-60%, depending on the cache
page replacement algorithms used. Some of the commonly used page
replacement algorithms include Largest File First (LFF), and Least Recently
Used (LRU). Prefetching proxies and application of models such as Markov
chains [Sarukkai 2000] on server client request traces are discussed in a later
chapter on Web mining.
5.3 Caching protocols
In the previous section, we discussed the idea of a single proxy server.

But we can imagine a network of proxies that communicate with each other,
and exchange information in addition to communicating with the
original/source servers in order to enhance the efficiency and diminish
latency in completing the resource request. A number of standards have been
proposed for communication between caching proxies such as Internet Cache
Protocol (ICP), Cache Array Routing Protocol (CARP), HyperText Cache
Protocol (HTCP), and cache digests.
Internet Cache Protocol (ICP) stemmed out of the Harvest project

[Bowman et al 1994]. ICP is a simple message transfer protocol tailored
towards information exchange between caches residing on different servers.
In a Web cache hierarchy, each cache establishes relationships with some of
the other caches. An example of an hierarchy is a parent/sibling hierarchy in
which caches which are at the same level are considered as having the sibling
relationships, whereas caches at a higher level (as defined by the hierarchy)
have parent relationship. ICP can enforce algorithms by which requests to the
server is processed. An example procedure is to see if the requested
document exists in the server's cache, if not then query the sibling caches' ,
and if that also fails forward the request to the parent. At some point, if the
document has not been retrieved from the original server, the request needs
to be forwarded to that server. ICP is independent of other refined protocols
such as HTTP, and is purely for inter cache communication.
HyperText Caching Protocol (HTCP) expands on ICP by enhancing the

level of interaction between the caches'. For instance, HTCP allows full
request and response headers to be used in cache management, and includes
monitoring of a remote cache's addition or deletion of information. HTCP
also supports hints on "web objects" such as availability and location.
In the Cache Array Routing Protocol (CARP) , the URL space is divided
among an array of loosely coupled proxy server. The goal of hashing URLs
to a particular cache server in the cache array eschews redundant caching of
the same documents, and hit rates improved. The two components of CARP
are a proxy array membership table, and a hashing function plus a routing
algorithm for mapping the requested URL.
In the cache digest approach, proxy servers exchange information about

the documents that are cached in that proxy. Cache digests are generated by
applying a hash function (such as MD5) on the URL ( plus HTTP retrieval
method). Cache digests are pulled from peer servers and maintained in
memory. When a client sends a URL request, the digests are checked to see
if there is a match, and the nearest proxy server with a matching digest is
queried for the resource.
5.4 Server Farms and Load-balancing
With millions of request every second made to large scale web sites, how
do you distribute these requests across many servers and balance the request
load on each server? How do you ensure fault-tolerance by having replication
of servers in case of server failures? Proxies can alleviate the situation, but
are only a part of the solution. This is because a proxy only acts as an
intermediary between the client and the original server, but can be a point of
failure in itself. Furthermore, as mentioned earlier, proxies also have their
disadvantages such as added complexity, and increased latency (e.g. in case
of a cache miss; i.e. document not found or expired in proxy cache).
One of the earliest approaches to splitting the load of requests is called

"mirroring". Mirroring replicates the content of a web site and is commonly
hosted in different domains on different servers. Mirrors are also generally
geographically in different locations, and can speed up access times to local
users significantly. The main drawback of mirroring is twofold: replication of
the web server data, and ensuring consistency in the replicas. For web sites
with a large amount of information, replication and ensuring that changes on
one copy are propagated to another can be a time-consuming task.
Even with mirroring, load balancing at a single point of entry into a group
of web servers is essential. Load balancing of web servers is typically done
in two methods: DNS round-robin, and hardware based routers/switches. In
the DNS round-robin approach, the domain name server responds to
translation requests with a rotation of IP addresses of different hosts in a
round-robin fashion. In this manner, series of requests are distributed across

many servers. There are many limitations with the DNS round-robin
approach. For one, intermediate name servers may cache the mapped IP
address, in which case the round-robin technique fails to load-balance
effectively. The second drawback is that the true solution should take into
account the "load" on a server, rather than just cycle through them. For
instance, a particular server may have high request processing times, while
another server may be relatively free. In the DNS round-robin approach,
there is no way of sending more requests to the server with the lesser load.
The second approach to load balancing is hardware based. Enhanced routers
can inspect and modify all IP addresses to map logically to a set of servers.
Switches provide load-balancing for a cluster of server machines within the
same subnet, in addition to other features such as isolation of data flow. The
disadvantage of hardware approach is that the hardware can fail or can itself
become a performance bottleneck.
Let us now summarize the above ideas. In order to have a fault-tolerant

and robust system, redundancy of servers is essential. There are a number of
heuristics that can be used to determine the actual number of servers in a
group, or "server farm". A simple approach is to use one more server than
the number of servers required for the anticipated load (called the N+ 1
approach). In order to distribute the load of incoming requests from clients,
this server farm resides behind a load-balancing scheme such as DNS round-
robin or hardware based switches. In case of a failure of some servers,
requests from some/all clients can go to the remaining functional servers thus
providing a reliable service. Things can get complicated when the servers
start storing state or session information for a client on the server side. Thus,
a clients request maybe "transparently" assigned to server A, and server A
stores some session state information for that particular client. If the next
request from the same client requires that state information, then the request
should go back to the same server A, or the stored state information should
be accessible by other servers behind the load-balancing rotation. Such issues
should be kept in mind when designing redundant, fault-tolerant server
systems.
5.5 Server Implementations
One of the early mechanisms used in the development of Web servers is

the Common Gateway Interface (CGI). CGI provides the ability to processes
dynamic web requests by associating specific requests to specific processing
programs. In this protocol, a request in the form of
http://www.testsite.com/cgi-bin/app?test=true will be automatically parsed
and interpreted by the web server (with CGI support) as request to the
program called "app" along with the specified parameters, in this case
test=true. Generally the application is in the form of scripting languages
handled by the system, such as Perl. Such a request results in the spawning of
a separate process where this program is executed, and the results returned by
the server to the client. After the request is processed, the process is
destroyed.
While CGI is a powerful concept, the main disadvantage of CGI is the

overhead of process creation and destruction. If a server is serving a hundred
clients, this overhead becomes significant enough to possibly result in poor
performance. An alternative to processes is the use of threads. In this
scenario, the server creates a pool of threads. Each request is handled by a
separate thread. While this eliminates the process creation overhead, the
overhead of thread management is introduced. Furthermore, since threads
share the same program space, appropriate locking mechanisms need to be
implemented in order to access shared resources. Java servlets are based on
this model of multithreading at the server, although the protocol is not CGI
based, rather classes dynamically loaded on the server from compiled Java
source code.
While comparing threads versus processes, processes are generally

attractive. Since each client request is generally independent of other client
requests, there is no need for a shared program or memory space.
Furthermore, its easier to conceptually associate each client request with a
separate process than a single thread of a multi-threaded system. The main
disadvantage of the process approach is the overhead of process creation and
destruction. An intermediary solution that has the benefits of processes,
while reducing the overhead of process creation/destruction is the reuse of
processes for multiple client requests. In this approach, a process pool is
created and maintained. Each client request is handled by one of the
processes from the process pool. However, after the client request is
processed, the process is not destroyed. Rather, its added back to the pool of
free processes in order to handle the next request. This approach has the
advantage of processes while eliminating the cost associated with repeated
process creation and destruction. Although the overhead of process pool
management is incurred, this has proven to be an effective practical solution.
FastCGI is an example of such an implementation. Mod""perl is another
variant that embeds a Perl interpreter inside the web server in order to speed
up CGI processing.
Other alternatives include Active Server Pages (ASP) which enables

scripting on the HTML page to be executed on the server side in order to
generate dynamic web content. PHP is another programming language
similar to Perl, and its interpreter embedded within a web server, allowing
PHP code inside HTML pages. This standard is supported by the popular
Apache web server (http://www.apache.org/). The handlers can also be
actually compiled using C/C++ code and invoked instead of the above
alternatives. There are many other alternatives to the above mentioned
methods, and description of the many scripting languages and server
platforms can be found elsewhere.
6. HYPER TEXT TRANSFER PROTOCOL (HTTP)
6.1 What is HTTP?
Hyper Text Transfer Protocol (HTTP) is a protocol to transfer

hypermedia information across a distributed network such as the Internet.
The first version (HTTP/O.9) was defined by Tim Berners-Lee in 1990. The
goals of HTTP were to have a simple text-based protocol of communication
between clients and servers in a distributed network in order to exchange
hypermedia documents such as HTML. HTTP is a protocol that is built on
top of the underlying TCP/IP layer.
HTTP allows clients to send a request in order to retrieve an URL, and

servers to respond with the contents of that URL. The simplest scenario of
HTTP is as follows: a client opens a connection with the server, and sends a
request, the server receives the request, server returns appropriate data back
to the client, and the connection is closed. This procedure of request and
response continues for different clients or repeatedly from the same client as
needed.
The first thing to note is that the protocol is stateless: no information in

HTTP for future use. Thus, state information needs to be explicitly
maintained by the client and/or the server. HTTP requests consist of three
parts: request line, request header and the entity body as shown in Table 17.
Table 17. HTIP 1.0 Client Request

HTTP REQUEST Example
Request Line GET /home.htmJ HTIP/1.0
Request Header Fields User-Agent: MoziIla/4.0
Accept: text!*
Followed by blank line
Entity Body [Entity Body]
The request line consists of a command or method, requested URL and

version of HTTP. An example of a HTTP request line is shown below:
GET Ihome.html HTTP/l.O
A simple way to test it on UNIX machines is to telnet into port 80 of

appropriate web server and type in the above command followed by a
carriage return. Thus, if you want to access the URL
http://www.w3.orglhome.html, then you would connect to www.w3.org at
port 80, and type in the GET request shown previously followed by a
carriage return. In response, the server would process the request and return
back appropriate HTTP error code and requested document. If you do a
"GET / HTTP/1.0", you will get the default home page for the site. A
sample response is shown below:
HTTP/l.O 200 OK
Date: Mon, 1 Jan 2002 23:59:59 GMT
Content-Type: text/html
Content-Length: 123
<HTML>
<BODY>
<H I> Welcome to my homepage </H I>
</BODY>
</HTML>
The GET command indicates that the client is requesting for the URL
content from the server. Two other commands allowed in HTTP/l.O are
POST and HEAD. POST is used when the client wants to submit data to the
server. HEAD is used when the client needs header information about the
URL resource. HEAD is useful when the client needs to verify if it's copy of
the requested resource is outdated or not. If the date of modification has not
changed since previously accessed and cached, then the lightweight HEAD
request saves the client from getting the whole document again.
The first line of the response contains the HTTP version and the status of
the request. For instance:
HTTP/l.O 200 OK
This is followed by the actual data or resource requested. Some of the
common returned error codes in HTTP are summarized in Table 18. In
general, Ixx error codes refer to informational codes, 2xx successful codes,
3xx redirection, 4xx client errors, and 5xx server errors.
Table J8. HTTP status codes

STATUS CODE MEANING
200 OK
201 Created
202 Accepted
204 No Content
301 Moved Permanently
302 Moved Temporarily
304 Not Modified
400 Bad Request
401 Unauthorized to access
403 Forbidden
404 Not found
500 Internal Server Error
STATUS CODE MEANING

502 Bad Gateway
503 Service unavailable
Header fields play an important role in HTTP. HTTP/l.O added header

information that enabled specification of the type of the document. The
underlying format was based on the Multipurpose Internet Mail Extension
standard (MIME). Other fields acted as modifiers on the request/response
semantics. For instance the "User-Agent" field informs the server of the
application that's making the request.
User-Agent: Mozilla/l.0
The "Last-Modified" field is specified in the response header to indicate the

last time that the retrieved document was modified (in Greenwich Mean
Time). This is useful for caching at the client or at proxy servers.
Last-Modified: Fri, 1 Jan 2001 23:59:59 GMT
The "Content-Type" and the "Content-Length" give information about the

type and length of the body of the HTTP message. This is useful for the
receiving application to read the specified amount of data, and to interpret it
appropriately. For instance, if the "Content-Type=image/jpeg" then the
browser can render it appropriately. A summary of header fields are
specified in Table 20.
6.2 HTTP 1.1
One of the limitations ofHTTPl.O is the absence ofpersistent connection.

Consider the case when a HTML document consists of both the HTML text
and embedded image SRC tags. Using HTTP/l.0, the client needs to make
multiple separate connections with the server in order to get the HTML
document, and the images embedded in that document. Since there is an
overhead to establish and close a TCP connection, it is inefficient to make
multiple connections for a single document. Furthermore, since the
connections are in CLOSE_WAIT state for some duration, multiple
connections would result in a lot of sockets to be in the CLOSE_WAIT state,
potentially being a bottleneck on the server.
A solution to this in HTTP 1.1 is to allow multiple requests with a single

connection. The persistent connection is a default in HTTP/l.l servers and
the connection is kept open after the server processes this request and returns
the data requested. The client can then send another request on the same
connection. The connection is kept open until the client sends a "Connection:
close" in the header.
[Krishnamurthy et al 1999] discuss the additional features in HTTP/I.l

including extensibility, caching directives, security, and content negotiation,
and we discuss the key points below. HTTP/I.l supports additional
commands: OPTIONS, PUT, DELETE, TRACE, and CONNECT. For
instance, OPTIONS enables the client to query the capabilities of the server
without requesting a resource. HTTPl.l extends the response error codes and
added the 'Warning' field to support additional features. Table 19 lists some
of the return codes that are absent in HTTPI.O. HTTP/I.O provided a simple
caching mechanism by using the headers 'Expires', 'If-Modified-Since' and
'Last-Modified' header fields. Thus, the origin server would return a
document with a 'Expires' header field. If a cache needs to check the validity
of a document, then it would send the 'If-Modified-Since' header field with
the value specified in the cached documents 'Last-Modified' header field at
the time of caching. The server in tum, responds with a 304 (Not modified)
in which case the cached copy is valid. If the server responds with a 200
(OK) response, then the document returned will replace the cached version.
One of the issues with the 'If-Modified-Since' header field in HTTP/I.O is
the need for synchronized absolute timestamps, which can result in errors if
there are clock synchronization errors. HTTP/I.I introduced the notion of a
validator string called the 'entity tag' that the cache and origin server
exchange with each other in order to match different entity tags returned in
the server response headers. The 'If-None-Match' header field in HTTP/I.l
allows the client to present one or more entity tags to the server, and the
server responds with information on which entity tag is currently valid (if
any). Furthermore, HTTP/I.I added the new 'Cache-Control' header that
enables an extensible number of cache-control directives to be specified.
Thus, another solution to the clock skew issue entailed due to the HTTP/l.O
'expires' header field, is to introduce relative expiration times (e.g. using
'max-age' field).Additionally, limited privacy features are provided with the
'private' or 'no-store' directives to prevent the storage of some or all of the
response data.
Another important area of improvement in HTTP/I.l is bandwidth

optimisation. A client can request only a part of a resource by using range
requests. Thus, a client can include a 'Range' header that specifies a range of
bytes, and the server can respond with the range of data and a 206 (partial
Content) response code. Range requests are useful in many instances such as
reading an initial part of an image to determine geometry for layout,
completion of a previously interrupted transfer, or even to read the tail of a
growing object. HTTP/I.I also delineates content encodings, which are end-
to-end data format encodings, from transfer-encodings, which are hop-to-hop
codings used for a message. HTTP/l.l also supports pipelining which allows
multiple client requests to be sent on a single TCP connection. The order of
the server responses should match the order of the client requests. HTTP/I.I
also introduces the notion of 'Chunked' transfer-coding that allows the
sender to break the message body into separate chunks. When the entire
message has been sent, the end is indicated by sending a zero-length chunk.
The use of chunking is indicated by using the 'Transfer-Encoding: chunked'
directive. HTTP/I.I also allows Hostname specification using the 'Host'
field that enables binding of multiple hostnames to the same IP address.
I HTTP/I I Response codes absent III

Toable 19 Sample . HTTP/l 0
Return Code Description
206 Partial Content
303 See Other
305 Use Proxy
307 Temporary Redirect
402 Payment required
The header fields supported by HTTP/I.I are listed in Table 20. In the area
of security, HTTP/I.l enhanced the challengelkey security mechanism
provided in HTTP/I.O by adding a digest authentication and proxy
authentication, so as to avoid transmission of username and password in the
clear (discussed in a later section on web security). Lastly, HTTP/1.l also
improved on the content negotiation features provided in HTTP/l.O by
providing server-driven and agent-driven negotiation mechanisms.
Table 20. HTTP/I.I Header Fields.

Request Header Field Response Header
Fields
Request Header Field Response Header

Fields
Accept Accept-Ranges
Accept-Charset Age
Accept-Encoding ETag
Accept-Language Location
Authorization Proxy-Authenticate
Expect Retry-After
From Server
Host Vary
If-Match WWW-Authenticate
If-Modified-Since
If-None-Match Entity Header Fields
If-Range Allow
If-Unmodified-Since Content-Encoding
Max-Forwards Content-Language
Proxy-Authorization Content-Length
Range Content-Location
Referer Content-MD5
TE Content-Range
User-Agent Content-Type
Expires
Last-Modified
Extension-header
6.2.1 State Information
It is important to note that HTTP is "stateless". Lets take the following

example: client A connects to server X and requests a document U. The
request and response are using HTTPIHTTPS. Thus, after the response is
sent, the connection between the client and server is closed, and the server
also does not retain any information about the client or its request. If client A
requests another document from server X, X treats the next request as an
independent one. The "state" information corresponds to information stored
from previous requests.
Since HTTP is "stateless", how do client and servers' keep state

information. The first solution is to store the state on the server side. This
requires that the client or session be identifiable in some unique manner, and
the stored information be persistently stored on the server side. Examples
could include passing around a hidden field variable that is passed with every
request back to the server. Another approach is to use the IP address of the
client. However, IP addresses can be spoofed, multiple clients can share an
IP address (e.g. though a Internet Service Provider), or IP addresses can be
dynamically assigned to a client. A second solution is the client side storage
of "cookies". A cookie is just a name-value pair that is restricted to a domain.
For instance, if a cookie called "mystate" is issued in the domain
".yahoo.com", then only servers in the .yahoo.com domain can set or modify
the cookie. The client software (such as a web browser) can submit such
cookies whenever a request to a server in that domain is sent. Since the
cookie is passed back and forth between the client and server, state
information can be maintained. The cookie can be set by sending the cookie
name and value as a part of the HTTP header. Cookies on the client side are
usually managed and stored on the local machine by the Web browser.
Whenever requests to that URL are sent from the client, the cookie is also
passed along in the HTTP header, and this can be processed by the server and
reset by the server to be passed in the next request. Since cookies allow the
storage and transfer of information about an user's previous requests2, it is
important to ensure that the security and the privacy of the user is not
sacrificed. Combination of cookies, IP and session identifiers are generally
now used for maintaining state information.
7. WEB SECURITY
7.1 Basics
Web security is an important part of Web technology. The Web is a

distributed medium for information exchange around the world. In this
process of communication, the following three aspects of web security are
important:
• Authentication: You are who you say you are.
• Confidentiality: No one else can read my message.
• Integrity: Is this really the message you sent?
2 Cookies, in many ways, ended privacy on the Web. Browsers do allow options of limiting or
even disabling cookie submissions, but many applications rely on cookies for proper
functioning.
• Non-Repudiation - Can the sender disclaim sending the message?
Authentication enables the verification of the identities of the parties

exchanging information across the Web. Confidentiality refers to the privacy
of the data, so that only the sender and receiver can comprehend the data
sent. Integrity involves verification that the data has not been modified after
it was sent by the sender. Non-repudiation means that the sender or receiver
cannot disclaim that the message was sent or received. This can extend to
other aspects such as repudiation of origin, submission, delivery, or receipt.
In all cases, a trusted third party needs to be involved in the process.
A simple method of authentication is the "challenge and response"

method. This is the scenario where the client enters a user name and a
password. The server matches this information with its stored information to
validate the identity of the client. A drawback of this approach is the need for
the server to store all the password information. A second issue is the need
for the password to be transmitted from the client to the server. Digest
authentication is an extension of the basic authentication scheme. In Digest
authentication, the password is not passed over the network. Instead, a hash
function over the user identification, password, URI, HTTP request method,
and a randomly generated number used only for that session (called "nonce")
is computed and passed across the network from client to server. The server
uses the same information to compute its hash, and if the hashes match, then
access is allowed by the server. Again, the server needs to maintain
passwords.
At the root of security is "encryption". Encryption refers to transforming

the original message into a different one in such a manner that it is very
difficult for parties other than the sender and receiver to decipher the
message. Encryption techniques are highly specialised and is a separate field
of study (called cryptography). A popular encryption algorithm used is the
RSA invented by Rivest, Shamir, and Adelman. Encryption algorithms use
parameters called "keys" that determine how the transformation occurs, and
allows the encryption and decryption of messages. Figure 10 shows a simple
scenario where a message is encrypted and transmitted from the sender to the
receiver. In this case, the client encrypts the message with a key, and sends
the encrypted message to the server. If someone eavesdrops on the
communication, they will receive the encrypted message, but will be unable
to decipher it without the key. The server needs the key in order to decipher
the encrypted message to retrieve the original message. The limitation of this
approach is the need to make sure that the key is securely shared.
Network
CLIENT
.... Encrypted
Message
..
....
SERVER
I
I
I
~.. .....~
....
...
r-------I
1 KEY I
1 1
~-------
Figure 10. Simple example of encrypted message transmission.
This approach is also useful when one party wants to send information
back and forth to itself. Since HTTP is stateless, it is often useful for the
server to pass state information to the client and have the client submit it
back to the server for future use. In such a scenario, the server can encrypt
this information with keys that reside on the server side, and later can decrypt
the information later using the same keys. Thus, the key is securely stored on
the server side and does not need to be transmitted to the client.
7.2 Digital Certificates
In practice, encryption techniques use pair of keys. These keys are

symmetric in the sense that either key can be used to decipher or decrypt a
message encrypted with the other key. However, one key cannot be derived
from the other. In a sense, the encrypted message can be thought to be a lock
with two keys, both of which can open the lock. One key is known as the
'public' key and is accessible publicly, while the other key is known as
'private' key and not accessible by anyone other than the owner. Thus, when
a sender wants to transmit information to another, then the message is
encrypted using his private key. Upon receipt of the message, the receiver
can successfully decrypt the message only if it is not modified. This assures
the receiver of the integrity of the transmitted message.
While the use of public and private keys ensures that the message has not
been modified, it does not ascertain that the message is actually from who its
supposed to be from. In order to address this issue, the notion of 'symmetric
keys' was introduced: if one key is used for encryption, then only the other
can be used for decryption. This notion of symmetric keys is very powerful
in ensuring not only the integrity ofthe data that's being transmitted, but also
the authentication of the sender. Let us now see how the notion of
'symmetric keys' helps in enabling a secure transfer of infonnation. This is
best illustrated with an example.
Lets assume that encryption and decryption with symmetric keys is like
having a lock that can be locked with one key but only unlocked with the
other. B wants to send some valuables to A. First A locks a box with his
private key and ships it to B. B receives the box, unlocks it with the public
key of A. B puts another box inside this one which is locked with the private
key of B. Then he locks the outer box with the public key of A, and ships it
back to A. Now A unlocks the outer box with his private key, and unlocks
the inner box with the public key of B, and gets the valuables that B sent
him. When the valuables were transmitted from B to A, a thief needs the
private key of A in order to break into the boxes. Furthennore, if someone
else has stolen the box that A sent to B, and locked it with their private key,
then A would know this since A would use the public key of B to unlock the
inner box.
This is exactly what happens with public and private keys. Digital
certificates are essentially keys stored in files. Digital certificates are
produced by an established certificate authority. Digital certificates also
serve the purpose of vouching for a third-party. Let us consider the case
when party C wants to send data to B. If C sends its public key to B, how
does B know for sure its from C. B trusts A, but not C yet. If A encrypts C's
public key using its private key, and sends it to B, then B would trust the
validity of C's public key. Certificate authorities such as VeriSign do this
and offer different grades of digital IDs.
7.3 SSL and HTTPS
HTTP has been extended to provide a secure means of -data

communication between client and server. This is achieved with the use of
Secure Sockets Layer (SSL). SSL lies in between the HTTP layer and the
TCP/IP layer. Using SSL on top of HTTP is also called HTTPS. URL's that
communicate using the HTTPS protocol begin with 'https:'. The sequence of
steps involved in HTTPS are outlined below:
a) Client connects to server using https protocol
b) Server signs its public key with its private key and sends back to the client.
c) Client decrypts the data sent by the server using the server's public key in order to
authenticate the server.
d) Client verifies if a trusted certificate authority signed the key.
e) Client generates a new asymmetric key for the session.
f) Client encrypts the asymmetric key using the public key of the server and sends it
back to the server.
g) Server decrypts using its private key to get the asymmetric key, and uses the
asymmetric key for encrypting future data for that session.
The first step in a HTTPS connection is the authentication of the identity of

the server. This is done by having the server sign its public key using its
private key, and sent back to the client. The client then uses the server's
public key to decrypt the public key of the server. Since the server's public
and private keys are symmetric, the client has verified (with a successful
decryption) that the server is indeed what it claims to be. Then the client
verifies if the key is signed by a trusted authority. If not, then the appropriate
action is take, such as closing the connection. If the certificate authority is
trusted, then the client generates a key that is to be used for encrypting the
data for the rest of the session. This key is an "Asymmetric key", and is sent
to the server encrypted using the public key of the server. An asymmetric key
is one that can be used both for encrypting and decrypting, unlike symmetric
keys where if one key is used for encrypting, then only the other can be used
for decrypting. Asymmetric keys are just computationally less intensive for
encrypting and decrypting. The server decrypts the asymmetric key using its
own private key, and uses it for the rest of the session. Thus SSL provides a
secure means of authenticating end-points, and achieving a confidential
transmission of data across the network.
It is important to note that SSL is a point-to-point security measure. In

other words, SSL enables integrity, authenticity from one point (example
client) to another point (server). What happens if the communication goes
through a proxy server, for instance? Such a scenario is called end-to-end,
and it is clear that SSL can ensure security from the client to the proxy, and
from the proxy to the origin server. An important issue is how we verify that
the message is from the source (and not modified by the intennediary such as
the proxy)? Furthennore, the weak point is often not necessarily in the
communication (i.e. sniffing of packets), but at the end-point itself (such as
breaking into a database). XML Signatures [ http://www.w3.org/Signature ]
present an alternate means of achieving digital signatures using XML. The
main advantage of XML signatures is the ability to selectively sign parts of a
document. This is often useful when different parts of the document are
authored by different parties, or only parts of the document needs to be
protected.
7.4 Non-Repudiation
As we mentioned earlier, non-repudiation3 refers to the ability of the

infrastructure to ascertain the following:
• Non-repudiation of Origin
• Non-repudiation of Delivery
• Non-repudiation of Submission
• Non-repudiation of Transport
The non-repudiation of origin ensures that the origin of the message is
verifiable in the signed message. Non-repudiation of delivery ensures that the
service will sign a proof of message delivery. Non-repudiation of submission
means that the service will sign a proof of message submission. Lastly, non-
repudiation of transport implies that the service guarantees that the message
has been transported by a delivery authority to the intended receiver. Many
approaches for ensuring non-repudiation have been proposed including time-
stamping of message, storage of evidence for submission and delivery, and
key and identity certification. In all cases, a Trusted Third Party (TTP) needs
to be involved in the process of non-repudiation.
3 Repudiation means "refusal, especial1y by public officials, to acknowledge a contract or

debt". When we discuss systems with non-repudiation, we refer to mechanisms by which
the system can protect the parties against such repudiations.
8. PRIVACY
A vast amount ofinfonnation about users' is available in web servers. We

have seen two instances where this can be controlled, namely HTTP privacy
directives, and cookie control in browsers. Privacy is an important part of a
relationship between a customer and a site. Using the infonnation collected
from the client/user can be used in a variety of interesting and useful ways
(discussed in the chapter on Web Mining), but it is imperative that the user
knows about the site's usage policy, and agrees with it.
At the basic level, any serious web site must publish a privacy policy
statement and abide by it. The Platfonn for Privacy Preferences (commonly
called P3P) is an emergent standard that enables a fonnal mechanism of
specifying privacy policies and communicating that with the user or a
privacy policy agent. For instance, a P3P agent can be built into a browser,
and can aid the user by popping up a alert (for instance) if the privacy policy
setting for the user has been violated.
P3P1.0 [WWW P3P] discusses the details of the P3P version 1.0
standard, and we'll highlight the basic points. Every web site must have a
standard location where their privacy policy is accessible (such as
/w3c/p3p.xml) or embedded in the link tag in the HTML document. P3PV1.0
describes in detail the various tags allowed in the P3P CML file, and how
they are interpreted. The goal is to allow P3P agents extract the P3P XML
description, and be able to present or integrate with the user's privacy
requirements. Additionally, privacy attributes may be specified in HTTP
responses from web servers. The policy file can specify details such as how
user requests and cookies are used, when the privacy policy expires, and
which zone the policy applies to.
9. CONCLUSION
The growth of the ARPA net led to the development of the Internet which
is the infrastructure of the World Wide Web. The basic notion in any
network is that of an "endpoint". Communication occurs between
endpoints. In the case of the Internet, the endpoint is specified as a 4-byte
IP address. Resources on the Web are identifiable using the URL fonnat.
The Domain Name System (DNS) maps the URL to a specific IP address.
TCP/IP protocol suite provides the transmission and control formats and
protocols for exchange of data packets in the Internet.
The client/server architecture is one of the fundamental mechanisms in

which the Web operates. The HTTP protocol is used for the
communication between clients and servers. In order to ensure security
and privacy of communication between clients and servers, encrypted
means of transmission is achieved using the HTTPS protocol.
FURTHER READING
[Stevens 1994] and [Wright et al 1995] are comprehensive books

covering the TCP/IP protocol suite. [Tanenbaum 1996] is a good textbook on
fundamentals of networking. [Krishnamurthy & Rexford 2001] is a
comprehensive book covering HTTP, web client/servers, and proxies. Web
security concepts are well-covered in [Garfinkel et al 2002]. [ICP 1997],
[HTCP 1999], [CARP 1998], and [Cache-Digest] are good resources for
descriptions of inter cache communication protocols. Limitations of PKI are
summarized in [Ellison & Schneier 2000], and XML standards for security
interoperability is discussed in [Kobielus 2001].
EXERCISES
1) Explain the following with examples:

(a) IP address
(b) URI
(c) URL
(d) DNS
2) Contrast TCP and UDP. Which has lesser overhead of transmission?

Which method guarantees delivery? How?
3) Design and implement a simple server that listens on port 80. Whenever a
client connects to the server, the server waits for the client to write the string
"hello", upon which the server responds with the string "Got the message.
Goodbye." to the client and closes connection.
4) Write a simple client program that interoperates with the server in exercise
(3).
5) Make the server in exercise (3) multi-threaded and be able to handle a

maximum of 10 clients.
6) Make the server in exercise (3) multi-process and be able to handle a

maximum of 10 clients. Contrast the performance with the server in exercise
(5).
7) Convert the server in exercise (5) or (6) to a simple HTTP server that
handles the HTTP commands "GET" and "HEAD".
8) Implement a proxy server that caches by storing previous requests by

storing on local disk. The stored data expires after 5 minutes.
9) Contrast the HTTPl.l protocol with its predecessors.
10) Discuss how secure communication is achieved between the client and
server using HTTPS.
11) The apache server (with cgi-support) can be downloaded and installed
from www.apache.org. Write a CGI script in Perl that accepts two integers
and computes the lowest common factor of the two integers. Design a HTML
page to submit the two values, and format the result from the CGI Perl script
usingHTML.
PART II
APPLICATIONS
Chapter 4
Information Retrieval
Mistakes live in the neighbourhood oftruth and therefore delude us
Abstract: What are the techniques used for the representation of infonnation contained in
large collection of data? How is a query matched with the stored infonnation
in order to retrieve relevant documents? The architecture and techniques
behind text infonnation processing are presented in this chapter. Algorithms
for document analysis (such as latent semantic indexing (LSD), query
expansion and notions such as document spaces are discussed with examples.
Keywords: Text Analysis, document spaces, indexing, search, query expansion, ranking,
precision-recall graph, latent semantic indexing, TF-IDF measures.

1. INTRODUCTION
Infonnation retrieval is the study of methods for representing
infonnation, and mechanism of locating specific parts of the stored
infonnation in response to a query. How do you fmd a sports article in a
newspaper? Probably by just going to the sports section and picking an
article. How do you pick an article on baseball? Perhaps by locating the
keyword baseball in sports articles. What about identifying an article
discussing the future of tennis? The complexity of the identification and
retrieval of documents increases with the amount of data and the constraints
on the query. It requires analysis of the document, and an "understanding" of
the content of the article. Imagine, having a book with a million articles, and
asked to find articles that are relevant to gravitational physics. What are the
steps involved in designing methods for computers to process huge amounts
of data, and extract relevant subsets of the data in response to a query from
the user? How does one condense vast amounts of data into useful compact
fonns? The field of infonnation retrieval addresses such questions, including
visualization, analysis and processing of infonnation. Although text
processing is primarily covered in this chapter, infonnation retrieval covers
other types of data such as audio, visual and multimedia.
Infonnation retrieval (IR), as the name suggests, is the study of

techniques used for the representation and indexing of infonnation. While
the problem is general and related to many areas such as artificial
intelligence, the primary focus of this field is the representation and retrieval
ofinfonnation (such as text retrieval).
The World Wide Web is composed of millions of hypertext documents,

and multimedia data. It is crucial to be able to represent all or a subset of this
infonnation in a manner that's useful for fast retrieval. A classic application
of the infonnation retrieval paradigm is the Internet search engine. The
Internet search engine is essentially a text retrieval system that retrieves
infonnation from the World Wide Web in response to query keywords or
phrases. Of course, there are important issues that arise in the context of web
search that are absent in classical text retrieval systems, and this will be
discussed in detail in a later chapter.
2. COMPONENTS OF IR SYSTEM
OUTPUT
TO
USER
... -
... .....
",.,
., ".
,
..,
,
,,
I DATABASE
. REPRESENTATION
RETREIVED
DOCUMENTS
''- ......
'-
Figure JJ. Overview of Infonnation Retrieval System
Let us now turn our attention to the basic block diagram of a generic
information retrieval system. A lot of the discussions will be around text
retrieval since that is one of the most common problems in information
retrieval. It should be noted that there are many other non-textual problems
that fall in the category of information retrieval. For example, a user may
show a picture and ask the system to retrieve "similar images". This would
involve a lot of image processing and visual pattern recognition, but can be
posed in the information retrieval paradigm.
Figure 11 shows a block diagram representation of an IR (abbreviation

for information retrieval) system. The left portion of the diagram (everything
to the left of the database box) corresponds to the "information
representation phase". The set of documents (which is the information users
will want to access) is transformed through a series of algoritluns (an
excellent edited collection of papers is [Frakes & Baeza-Yates 1992]) and a
final representation of the information is stored in a database or some other
stored form which can be easily and quickly accessed in response to an
user's request.
The other modules (above, and to the right of the database box)
corresponds to the actual "information retrieval phase". The user requests for
some information by giving the system a query. The system parses the query
and does some transformations on the user's query, and retrieves relevant
information from the database. Typically, the information retrieved is large
for huge data, and this retrieved data should be further processed in order to
determine the relevancy/ranking of goodness to the user's original request.
The transformation operations allow the system to clean up the data. The
usual sets of transformations include tokenization, and document/term
operations. Term operation methods include stemming, stoplists, query
expansion with thesaurus, truncation, weighting and so on. These methods
are discussed in the following sections.
Two notions fundamental to information retrieval are index and document

spaces. Many approaches in information retrieval represent documents as
vectors. The individual components of the vectors are words. The vector
representations can be simple (such as whether the word is present or not) or
can indicate a probability or weighting of the presence of that word in a
specific document. Thus, each document is a point in this vector space, also
termed as "document space". Vector operations such as clustering, and
projections may now be applied to these document vectors. Each individual
word is one dimension in this vector space and is called an 'index'. During
representation phase, documents are mapped to document vectors, and
representations of groups of documents extracted using various algorithms.
During retrieval, the query keywords are used as 'indexes' to search for
relevant documents using the keywords as indices into this vector space
models.
3. TEXT PROCESSING
3.1 Tokenization
The first step of information data processing is to determine the basic

units or features that will be used for retrieval. In the case of text retrieval,
tokens or words can be simplistically defined as any sequence of letters
delimited by white space. However, we need to address the issue of
punctuation (including hyphenation), numerals, acronyms, Internet address,
formatting directives (such as "<H4> </H4>" in HTML) and so on. While
the problem of tokenization is not difficult, it must be carefully designed

since it affects the performance of the whole system.
Document analysis and retrieval methods work using a set of basic units
or tokens. A token can be arbitrarily defined depending on the context of the
information analysis and retrieval task. In text analysis, words are good basic
units upon which to base analysis, and indexing on. Tokens can be defmed at
sub-word or multi-word levels. In other tasks such as image retrieval, basic
units could be lines or boundaries in the image.
A standard approach to tokenization is to write a lexical analyser using a

tool such as lex (in UNIX). Alternatively, lexical analysers can be hand
coded or specified by Finite State Machines/Regular expressions or Context
Free Grammars. An over-generative regular expression for a simple
tokenizer is shown below:
Table 21. Regular Expression Generator for a simple tokenizer.

<NUMERALS> = [0-9]
<ALPHABETS> = [a-zIA-Z]
<PUNCTUATION> = ['I:H"I.I?I!II]
<TOKEN>=[ ALPHABETS>*[<NUMERALS>I<PUNCTUATION>]* ]+
In reality, regular expressions for tokenizers are much more complicated

since the variability and special-cases of the data should be properly
handled.
3.2 Stoplists
The terms or tokens produced by the tokenizer will be used as features

(indices) for the information retrieval system. Thus it is important to filter
out redundant, useless or inaccurate terms that tend to lower the performance
of the information retrieval system. For example, consider the word "the" in
English. It is the most frequently occurring word. It is redundant and does
not convey any unique information about the document itself. However, the
word "church", for instance conveys meaning by itself. For example, if we
were to classify documents according to whether the text pertains to religion
or not, words such as "church" could be useful indicators or terms to make
such a decision. It is useful to discard words such as "the" that do not convey
any special meaning from the IR process. In fact, words that are highly
frequent are often not good discriminators for the categorization of text.
Such lists of words that are excluded in deriving the actual representation of
the infonnation are called "stoplists", and are discarded during the initial
stages of processing.
Another utility of the stoplist is to discard words that do not pertain to a

special topic of interest. For example, if the infonnation retrieval system is
being built to cover infonnation on the art of gardening, then words such as
"bicycle" are irrelevant, and should be generally discarded from the IR
indexing terms. It is common to filter out the stoplists at the lexical analysis
stage itself. [Francis & Kucera 1982] present a stoplist of 425 words
extracted from the Brown Corpus.
Table 22. Example of stoplist words

a, an, and, any, another, as, at, be, before, both, but, by, can, cannot,
could, come, did, do, does, each, ever, every.
3.3 Stemming
Stems refer to morphological roots of words that are useful for indexing.
The stemming program clusters words into morphological classes. For
example, the words "musical" and "music" can be stemmed to the root
"music". Of course, some infonnation is lost, but such a reduction allows IR
systems to build compact systems with lesser index tenns. Reducing the
number of index terms has many advantages: better trainability, reduction in
the size of index files, and faster search/retrieval.
Mfix removal is one method of stemming wherein the prefixes and

suffixes are removed, leaving the stem. Mfix removal may be implemented
in a variety of methods: for instance, the program can search for the longest
matching string, and choose that as the stem. The longest match stemmers
are typically iterative. A popular example of such stemming algorithms is
the Porter algorithm that uses condition/action rules on the stem, suffix, and
the rules.
Porter's algorithm [Porter 1980] defines conditions on the stem, suffix,

and the rules. Conditions on the stem include checking whether a vowel is
contained in the stem, whether the stem ends in a double consonant or a
specific letter, and so on. Furthennore, every stem has a "measure"
associated with it. This is calculated by matching the stem with the pattern
(C)(vC)m(V). Here, C refers to any consonant, V refers to any vowel
(including y after a consonant), and tenns in the parentheses are optional.
An example rule from Porter's algorithm would be:

If (m>O) then replace the suffix "eed" with "ee".
For instance, "agreed" would be mapped to "agree".
N-gram stemmers find the unique n-gram sequences contained within a

word. A similarity measure between the unique n-grams is used to cluster
words into stemmed groups.
Table 23. An example ofN-Gram Stemming.

Word Digrams Unique Digrams
Graph {gr,ra,ap,ph} {gr,ra,ap,ph}
Graphics {gr,ra,ap,ph,hi,ic,cs} {gr,ra,ap,ph,hi,ic,cs}
Graphs {gr,ra,ap,ph,hs} {gr,ra,ap,ph,hs}
The similarity between two words 'wI' and 'w2' can be computed using
Dice's coefficient as follows:
CD(wl,w2) = #Unique Digrams common between wI and w2

UD(w) = #Unique Digrams in w
D(wl, w2) = 2CD(wl, w2)

UD(wl) +UD(w2)
This can then be represented as a matrix and clustered to produce

groupings of words to determine stems. For the example shown previously,
the similarity matrix (which is symmetrical) can be computed to be:
Graph Graphics Graphs

Graph I 8/11 8/9]
Graphics
8/11 I 8/12
Graphs [
8/9 8/12 I
Figure 12. Example of a "Similarity Matrix".

Researchers have also applied linguistic knowledge to build "successor

variety stemmers"[Hafer & Weiss 1974]. The successor variety method
estimates the entropy at every character position, and use points where the
entropy change is high in order to determine stems.
The probability of a letter occurring at a particular position is defined as

the following conditional probability:
Here, 'c1c2...cn' are the letters that occur before the letter 'j'. The
entropy at that position for the string 'c1c2..cn' can be defined as
follows:
P( } '1· CICZ ... Cn ) -- C(CICZ ••• CJ)

"Ik C(CICZ ... Cnk)
H(CtCz ••• Cn) =

L j - P(j ICtCz. •• Cn) x log zP(j IC1CZ ••• Cn)
The concept of follower variety stemmer is best illustrated with an

example. Assume that the set of words { use, users, uses } is given. The
stemming analysis using the successor method proceeds as follows:
Pick a word, say 'users', and estimate probabilities and entropies at each
letter position.
2nd position: P( s I u) = I ; H(u) = 0

3rd position: P( e I us) = I ; H(us) = 0
At the 4 th position, there are 2 options { r, s }:
P( r I use) = 1/2; P( s I use) = 1/2 ;
The Entropy of 'use' can be calculated to be H(use) = 1

At the 5th position, there is one option { s }:
P( s I user) = 1 ; H(user) = 0
The above data is summarized in Table 24.
Table 24. Example of Entropy Successor Stemming.

Letter Position Prefix Entropy
2 U o
3 Us o
4 Use 1
5 User o
Thus, we can observe that there is an entropy increase at the fourth

position suggesting that the term "use" may be a stem. Further rules are
applied in detennining whether a term is a viable stem or not. For example,
[Hafer & Weiss 1974] use the frequency counts of the terms to detennine if
the term is chosen as a stem.
3.4 Thesaurus
Thesaurus is a collection of words with synonym related words. For

instance, in a Thesaurus presented by Time, the entry for the word "clique"
is as follows:
Clique:
Syn: cabal, camarilla, camp, circle, clan, coterie, in-group, mob, ring
The thesaurus allows both to expand the user's query and perhaps even
reformulate/weight the terms. If the search retrieves too many items, the
thesaurus can be used to limit the query to "important" terms. On the other
hand, if the search does not retrieve enough documents, then the thesaurus
can be used to expand the query and thus expand the search. In the above
example, the query of "clique" can be expanded by searching for the
synonym words specified by the thesaurus. However, there is a risk of
deviating from the specificity of the original query. Expanding the search to
include "ring" for an original query of "clique" may result in unrelated
results with "ring" in a different context or sense (e.g. "wedding ring").
Thesaurus can be constructed automatically using a variety of methods.

Typically a database of thousands/millions of words is used to construct the
thesaurus. The first step is the construction of the vocabulary. This involves
stemming, and frequency analysis of the stemmed terms. Some of the earlier
text retrieval work suggests that words in the mid-frequency range are good
indexing terms (i.e. not too frequent or infrequent). Another approach is to
evaluate terms based on this discriminability- the discrimination value with
and without the word are measured and the terms with positive
discrimination value are retained in the thesaurus. Frequently occurring
phrases are also extracted and added to the thesaurus. Some other
researchers have also used co-occurrence values to construct phrases. In the
last step of thesaurus construction, the vocabulary is organized in the form of
a hierarchy based on the statistical similarity distances between terms.
4. INDEXING AND SEARCH
An important aspect of information retrieval is to extract indices that will

be used to represent documents, and to retrieve relevant documents
efficiently. In earlier sections, we noted how indices were extracted by
removal of stopwords, and application of stemming. In this section, we will
see how documents are stored for efficient access using these indices. An
important aspect of indices is that they should not only be relevant to a topic
but also allow discrimination of this topic from other topics.
4.1 Inverted Files
Inverted files are a form of representing indices in a document or a group

of documents. Essentially, an inverted file or inverted index consists of a list
of indices and the list of word locations that the index occurs in within a
document or a set of documents. For instance, Table 25 illustrates an
example of an inverted index of the paragraph:
"Beauty and symmetry co-exist in every botanical specimen discovered

so far. Added to the geometric patterns, the unique blends of colors that
flowers bloom in adds credence to the theory that beauty and symmetry an
integral part of creation....."
Table 25. Example of inverted index.

Words Occurrence
Beauty 1,32,45,56
Synunetry 3,34,95,120
Botanical 7,...
Word "beauty" occurs in the lSI, 32nd , 45 th , 56th positions, "symmetry"

occurs in 3Td , 34th , 67 th position of the document and so on. Stop words are
generally excluded and eliminate a large portion of the storage required for
the inverted index. The set of words (or vocabulary) is also reduced by
stemming and other text processing methods discussed earlier. Generally,
the occurrence is specified as "block" locations, where a block is a segment
of text, or document filenames, and this reduces the size of the index file, but
the exact location of the word in a specific document cannot be identified.
Now, let's see how the inverted file helps in retrieval of relevant
documents. If an user searches the text database with the keywords "beauty,
symmetry" in close proximity to each other in the text, then the first step is
the lookup of the keywords in the vocabulary of the text. Next the
occurrence list for each index is retrieved for analysis. Lastly, operators and
constraints are applied to produce a result. In our example, the occurrence
list for "beauty" is {1,32,45,56 }, and the occurrence list for symmetry is {3,
34,95, 120}. Next, enforcing a proximity constraint (for e.g., the two words
should be within a distance of 5 to each other) results in the occurrence list
{1,32} for "beauty" and "symmetry" within 5 words to each other. Since the
vocabulary for large text databases can be in the thousands, and the number
of occurrences can be large, the inverted index stored on disk is divided into
two parts: a sorted vocabulary file with information on the location of this
index in the inverted index, and the inverted index file. Merging of two
inverted files would require a sorted merge of the vocabulary, in addition to
merging of the inverted list occurrence list for duplicates words. A drawback
of inverted files is the assumption that the text is a sequential combination of
words. Searching for sequences of words, such as phrases, involves a
complicated match of the occurrence lists for each of the individual query
words.
4.2 Tree Structure
An alternate representation is to construct a prefix tree using the text

data. Index points are chosen at specific points as desired, and a prefix tree is
generated using these index points. The tree is constructed by traversing
each character starting at an index point, and adding a node for each
character. At the leaf node, the actual text position of the sequence of
characters from the root of the tree to the leaf is identified.
Table 26 shows a sample sentence with the character position of each

word in the sentence. The word "essence" starts at character position 4,
"happiness" lies at character position 28, and so on. The asterix marked
positions represent index points. The tree is created by starting at each index
position, and traversing one character at a time. If there is an arc labelled
with the character at that position, then the traversal moves to the next node.
If there is no label with the character seen at that position, then a new node is
created, and the label of the arc between the two nodes set to the character at
the current position. The leaf nodes store the word position of the word
that's generated by traversing from the root node of the tree down to the leaf
node. The tree can be compressed by merging nodes with single outgoing
arcs. Such a tree is called a prefix tree since the tree stores the prefix of each
word at the indexed positions.
Table 26. Example of text for prefix tree creation.

Character * * *
position I 4 15 28 42
Text The essence of life lies in happiness and health.
Thus, the tree is created by starting at character position 4. The first arc
from the rot node will have the label "e", the second arc from that node has
the label "s", and so on. The actual number of labels can be limited to a
specific length of characters (or less if the word is shorter than the limited
length). Next, the character sequence for the word "happiness" at character
position 28 is added to the tree by starting at the rot node, and traversing
down the tree, matching each character with the label on the arc, and
creating a new arc if the character is absent. The resulting tree is shown in
Figure 13.
Figure 13. Example of a prefix tree.

Searching a prefix tree consists of traversing the tree from the root node
with each character of the query word until a set of matching leaf nodes are
reached, where the character positions of the prefix are stored.
4.3 Signature Files
The idea behind signature files is to divide the document into blocks, and
associate with each block a signature that identifies which words may be
present in that block. This signature is in the form of a bitmap that's
generated using a hashing function applied to each word in the block. In the
example below, each block of text is enclosed in square brackets.
[The new version ofthe software was just released.] [Windows XP is the
latest version of software released recently.][What software release lies
ahead?]
Assume that the hashing function generates the following bitmaps:

version: 010010
software: 001100
windows: 100001
In order to create the signature file, each text block is hashed using the
hashing function to generate a signature. Typically, the bitmap of the words
of the block are bitwise-ORed to generate the signature for the block. Thus
the signature bitmap for the first block of the example is: [011110]. Thus,
the signature file consists of a sequence of bitmap signatures with the index
pointer to the corresponding text block in the original document.
Searching using signature files is performed by first generating a bitmap

by applying the hashing function on the query words. Next, this bitmap is
compared with each signature in the signature file, and when a match is
found, the text block corresponding to the signature is retrieved. It may be
noted that the search is sequential, and the complexity is linear with respect
to the length of the document. The second observation is that the hashing
function determines the efficiency of the signature approach. Since the
signature is a combination of words within the text block, there is a
possibility of false indications of words that are actually not present in that
text block. Such false matches can be minimized by appropriately defining
the hashing function.
5. RANKING
5.1 Need for Ranking
While retrieval of documents relevant to a particular query is important,

it is often not sufficient. In large collections of documents, common queries
tend to match many documents. An important step is the ranking of these
matched documents. Thus, some documents may be ranked with a high
degree of relevancy to the query terms, while other matches may be less
relevant.
5.2 Term and Document Frequencies
Ranking is commonly done using statistical techniques. A popular metric

used for ranking is the TF-IDF measure. TF refers to "Term Frequency", and
IDF refers to "Inverse Document Frequency".
Given a document Di and a word wj, TF and IDF are defined as follows:
Dn
IDF(wj)=--
D(wj)
Here C(wj,Di) is the number of occurrences of word wj in document

Di. C(Di) is the number of words in document Di. Dn is the total number
of documents (note: there are numerous variants here; one method is just
W(Di,Wy) = TF(Di,1V}) x 10g(IDF(1V}»

to count the number of "index words" as opposed to all the words).
D(wj) is the number of documents that contains the word wI The actual
weighting function W(Di, wj) can be defined in many ways. For instance,
a common metric used is:
The ranking metrics of all the terms in the user's query can be weighted
and combined to give a total TF-IDF metric that can be used for ordering
and selecting the retrieval results. Intuitively, TF tells us how frequently a

term appears in a certain document, while lDF tells us how many documents
the term occurs in. Thus, if a term appears frequently in a particular
document, and infrequently in other documents, then it is a good index to
that document.
5.3 Ranking Using TF-IDF
Ranking can be done by comparing the query to each document, or

collectively to a group of documents. Given a set of document, and a
vocabulary, each document can be represented as a vector ofTF-lDF values
for each term. The query vector can be matched with the vector
representation of each document, and the documents ranked on the basis of
the cosine correlation of the two vectors. This approach is generally called
the vector-space model approach.
It is interesting to note that the query can itself be a document, with

which a TF-lDF vector can be constructed, and related documents retrieved.
Other approaches to document ranking include probabilistic methods. It is
also possible to rank the best fitting retrieved documents using different term
weightings or different models. Furthermore, if the documents are
unclassified, then unsupervised clustering can be performed on the vector
space representation in order to produce similarity relations between sets of
documents.
5.4 Illustration
Table 27. Sample documents for Vector Space illustration.
Mystery Document 1 Mystery Document 2
The command is simply an Holmes stretched out his hand for
executable program. The shell reads the manuscript and flattened it upon
typed commands and executes his knee. "You will observe, Watson,
programs. UNIX has a file system the alternative use of the long sand
arranged as a hierarchy of the short. It is one of several
directories. This manuscript is indications which enabled me to fix
divided into several chapters. the date."
Let us assume that the indexing terms extracted from the sample
documents shown above are: {UNIX, manuscript, Holmes, Watson}
Note: In the example below, we count only the indexing terms.
TF(Doc. 1, "manuscript") = Y2
TF(Doc. 2, "manuscript") = 1/3
lDF("manuscript") = 2/2 = 1
Thus,
W(Doc. 1,"manuscript")
= W(Doc. 2, "manuscript") = 0;
TF(Doc. 1, "UNIX" ) = Y2
TF(Doc. 2, "UNIX" ) = 0
lDF("UNIX") = 2/1 = 2
W(Doc. 1, "UNIX" ) = Y2
[Assuming logarithm to base 2.]

W(Doc. 2, "UNIX" ) = 0
Similarly,
TF(Doc. 1, "Holmes" ) = 0
TF(Doc. 2, "Holmes" ) = 1/3
lDF("Holmes") = 2/1 = 2
W(Doc. 1, "Holmes" ) = 0
W(Doc. 2, "Holmes" ) = 1/3
Thus the vector space representations for the two documents are shown
in Figure 14.
UNIX 1/2
UNIX 0
manuscript 0 manuscript 0
Holmes 0 Holmes 1/3
Watson Watson
0 1/3
Doc. 1 Doc. 2
Figure J4. Document vectors for the two sample documents.

Now, let us examine what happens when a user queries with the keyword
'Watson'. The query can be represented in the form of a vector I (note that
the last column corresponds to the word "Watson"):
[0 0 0 1]
The dot product between the input vector I and the document vector
gives a measure of the match between the query and document. Thus, the dot
product between document I and the input query = 0, whereas the dot
product between document 2 and the input query = 1/3. This distance is also
called the cosine correlation distance, and commonly used to measure
similarity between vectors. Thus, in this example, document 2 will be
retrieved in response to the query word "Watson". Thus, the query words
can be used to construct the input vector to be used in matching.
5.5 Discussion
Furthermore, it is possible for the query itself to take the form of a

document. In this case, the query can be interpreted as a sample document
that the user provides in order to retrieve similar documents.
It is interesting to note that the vector space representation of documents

allows us to cluster in this vector space. Thus, we can automatically group
together "similar documents". By now, it should be obvious to the reader
that the word order information is discarded in the above example.
Researchers have studied the use of N-gram word sequences in addition to
individual stems. The problem is that the already high dimension of these
vector spaces increases much more with word sequences or phrases.
Furthermore, when single words are used in constructing the vector
representation to be used for clustering documents, the "similarity" between
documents is purely in terms of word frequencies, and not "meaning" really.
But the vector space model has proven to be very useful in automatic
information retrieval and is a popularly used method.
6. QUERY OPERATIONS
6.1 Purpose
It is often difficult for a user to formulate the right query to retrieve

specific information. In order to refine the query that the user input, query
operations process and modify the query in order to improve retrieval
performance. Two common query operations are query expansion and query
reweighting. In query expansion, the user query is processed and expanded
based on the relationships of words in the query to other words in the
documents. In the term reweighting approach, the terms in the expanded
query are reweighted so as to improve retrieval performance.
It is important to note that the query expansion can be achieved using

various factors [Frakes & Baeza-Yates 1993]. The first factor is feedback
from the user, also called "relevance feedback". In such an approach, the
user's implicit or explicit approval of a retrieved document is used to modify
the query expansion and reweighting parameters. The second approach is
utilization of global information from the whole document collection.
Example of the global approach is extraction and utilization of similarity or
statistical thesaurus. The third approach is a local analysis of the query itself,
or of documents retrieved by the initial query, and application to
modification of the query.
6.2 Relevance Feedback for Term Reweighting
Relevance feedback is one of the popular methods of query expansion.

The idea is to use the feedback from the user on which documents are
relevant to the query, and which are not in order to weight the query terms so
as to bias the retrieval towards the relevant documents. Typically, a set of
documents are retrieved and shown to the user (based on the user's query).
Next, the user marks some of them as being relevant. These are analysed, the
terms that are pertinent to these documents are weighted so as to increase
their importance than other terms. Typically, the weight of the terms in
documents that are marked relevant are increased (by a constant normalized
by the cardinality of the set of relevant documents), and weights of the terms
that are deemed non-relevant are decreased (again normalized by the number
of non-relevant documents). This procedure is parameterised by weighing
the terms in the original query, terms in the relevant documents, and terms in
the non-relevant documents using different constants that can be iteratively
tuned. Another approach to term reweighing using relevance feedback is by
estimating the probability that a term is present in a relevant/non-relevant
document. Such methods generally assume term independence and binary

document weighting (i.e. document is relevant or not), and variant methods
to overcome the limitations have been researched.
6.3 Clustering methods for term associations
In clustering approaches to query expansion, documents are analysed for

clusters in the document space. This translates to groups of words with some
relationships between them. For instance, association clusters of words could
be extracted by detecting frequently co-occurring sterns inside documents.
The cluster analysis can be global (i.e. on the whole document space), or
local (i.e. subset of the document space). A common method of extracting
associations is using correlation matrix between the stems. Experimental
results seem to indicate that local clustering approaches tend to perform
better than global clustering procedures. A simple example of local
clustering is by clustering using only the documents retrieved in response to
the user's query.
6.4 Automatic Thesaurus expansion methods
In the global expansion methods, information from the whole set of

documents is collectively used in order to generate weighted relationships
between words. An example is the automatic generation of a thesaurus. In
the simplest case, the terms in the set of documents are analysed for frequent
co-occurrences in the document space: for instance frequency of co-
occurrence within the same documents. Another approach is to use the
notion of a concept space, where each term is indexed by the documents in
which it occurs. The similarity between terms in this concept space can be
used to extract related terms used for query expansion (similarity thesaurus).
After constructing either a local or a global model, the query expansion
procedure can be summarized as follows:
• (if Local Analysis) Retrieve documents for query

• (if Local Analysis) Analyse retrieved documents and expand
query.
• (if Global Analysis) Analyse query with global models and
expand with related weighted terms.
• Retrieve Documents for expanded query.
6.5 Example
As an example of query expansion, let the user query be "quotes".

Clearly, this is quite a generic indexing term, and it is difficult to predict
what the user is exactly looking for. A direct retrieval based on this word is
likely to result in a large collection of unrelated documents. Query expansion
allows the information retrieval to include "related items" in order to expand
the query. In this example, quotes might have the following set of related
items "quotes" 7 {"stocks", "punctuation"}. Thus a search on this expanded
query will likely produce focused documents on stock quotes. Of course if
the user is not looking for either "stocks" or "punctuation", then that's a case
where query expansion hurts retrieval performance.
7. LATENT SEMANTIC INDEXING
7.1 Motivation
In an earlier section, we studied the application of TF-IDF measures to

estimate the match between a document vector and a query vector. Since
each document is a vector of dimension 'N', where 'N' is the vocabulary
size, 'M' documents can be represented as a 'N'x'M' vector. If the
vocabulary size if 1000, and there are 1000 documents, then we have a
vector with IMillion terms. Typically, the vocabulary and the number of
documents are larger, thus increasing the number of parameters to a large
number. For such large number of parameters, a very large amount of data is
required to reliably estimate them. Furthermore, since the query vectors are
quite sparse, it is difficult to measure the import of a group of index words
on the relevant to the set of documents.
7.2 Latent Semantic Indexing
Latent Semantic Indexing (LSI) [Deerwester et al 1990] is a technique

for projecting both document and query vectors into a lower dimensional
vector space, in an attempt to capture the most significant vectors. Consider
the following example document vectors:
Document I: [ I 0 0 I I 0 0 0]
Document 2: [ I I I I I 0 0 0]
Document 3: [ I 0 I I I 0 I .,. I]
The idea behind Latent Semantic indexing is to project these document

vectors to a lower dimension vector space. It is important to maintain the
'separability' between the documents in this lower spaces. This lower
dimension projected vector is called a "latent semantic vector". Its latent
since it's a 'hidden' space. The term semantic indicates that the projection
captures some type of "semantic" information in the group of words that
make up the document.
7.3 Constructing the Document Matrix
Latent semantic indexing applies factor analysis to the problem of

document indexing. The first step in the LSI (Latent Semantic Indexing)
approach is to construct a document representation matrix, by combining the
vectors shown in Figure 14, as shown in Figure 15.
UNIX 1/2 0
Manuscript 0 0
Holmes 0 1/3
Watson
0 1/3
Figure /5. Documents Matrix representation A.
Thus each row corresponds to a term, and each column to a document.

Each element of the matrix A[w, d) is the TF-IDF measure of the term 'w' in
that document 'd'. Clearly with a large number of terms and documents, this
vector is huge and typically sparse. Thus, in a sense, the problem is over
parameterised. In order to extract the "important features" from this matrix,
factor analysis can be done and only the top few Eigen vectors retained.
Singular Value Decomposition (SVD) can do this as shown in the equations
below.
The matrix A is decomposed into three matrices U, S, and V. If the

dimension of the matrix A is (N x 1) (i.e. N index words, and T documents),
then U is the (N x R) matrix of left singular vectors, S is a diagonal (R x R)
matrix, and V is the right singular vectors of dimension (N x R). The matrix
S is by definition positive definite, and the matrices U and V are unitary.
Note that the resultant matrix is an Approximation when R« N,T.
The term "Latent Semantic Indexing" was coined to indicate that the
method tries to extract the most important (in a mathematical sense)
"semantic" information represented in the matrix A. LSI is not only useful in
terms of efficiency of computation and storage, but also improves the
retrieval quality in many cases by automatically filtering "noisy information"
(as a by-product of eigen value ordering). LSI may also assist in solving the
problem of synonymy: different terms describe the same underlying
concepts. By projecting into the most prominent sub-spaces in a non-linear
fashion, some of these difficult to extract relationships may be automatically
captured. Traditional vector space models assume term independence, but in
reality many terms are related to each other, and show strong associations.
Such term dependencies are often minimized in the projected space.
Drawbacks of LSI include huge storage requirements and computational
costs, in addition to assumption of normally distributed data.
8. EVALUATION METRICS
8.1 Evaluating Retrieval Methods
It is not easy to measure the quality of an information retrieval system.

On the one hand, determining whether a set of documents matches a query is
based on human judgment, and usually subjective and unreliable. On the
other hand, the goal of an information retrieval system is to retrieve
documents that satisfy users requests.
8.2 Precision
Precision measures the fraction of documents retrieved by the

information retrieval system that are relevant. This measure is quantified by
defining the following sets:
R - set of documents relevant to a query Q

A - set of documents that are retrieved by the system.
Precision is defined as:

IRnAI / IAI
Precision measures the ratio of the number of relevant documents retrieved

by the system to the total number of retrieved documents. High precision
indicates that the retrieval system is able to retrieve a large number of
documents relevant to the query.
8.3 Recall
In order to achieve high preCISIOn, the system can generate a large

number of answers. In such a case, the precision of the system may be high,
but the number of documents retrieved is high, thus reducing the utility of
the retrieval system. "Recall" is measured to limit the number of non-
relevant items.
Recall is computed as follows:

IRnAI / IRI
Thus, recall is the ratio of the number of correct documents retrieved to
the total number of relevant documents.
8.4 Precision Recall Graph
Quantitative measures allow the comparison of different techniques. The

most common quantitative method of reporting information retrieval
performance is the recall and precision graph. Recall is the ratio of relevant
documents retrieved in response to the user's query divided by the total
number of relevant documents in the database. Precision is the ratio of the
number of relevant documents retrieved divided by the total documents
retrieved in response to the user's query. Ideally, a perfect information
retrieval system should have both recall and precision values at I. In reality,
increasing the precision decreases recall and vice-versa. Thus the precision-
recall curve is typically as shown in Figure J6. IR algorithms are typically
evaluated and compared on standard databases such as the TREe databases
available from National Institute of Science and Technology (NIST).
Precision
o
Recall
Figure J6. Precision versus Recall Graph
9. CONCLUSIONS
In this chapter, the reader has been introduced to the field of information
retrieval. The problem of information retrieval is to retrieve documents that
are relevant to a user's query. Various steps involved in the design of an
information system have been described.
The early steps of filtering textual data include tokenization, stemming,

and thesaurus construction in order to define indexing terms. Using the
indexing terms, the documents are stored in a format appropriate to fast
indexing. The document vector space model is commonly used, and the TF-
IDF measure has been very successful for indexing. There are many
outstanding issues that remain in IR systems in all of the steps in IR.
Performance of IR in large databases is still to be improved. Relevancy
feedback to the user is an important and difficult problem.
FURTHER READING
[Frakes & Baeza-Yates 1993] and [Baeza-Yates 1999] are excellent

books that cover the techniques in information retrieval. [Francis & Kucera
1982] is an interesting book on frequency analysis of the English language.
[Deerwester et al 1990] covers the technical aspects of latent semantic
indexing.
Ramesh R. Sarukkai III
EXERCISES
Take two sections from your favourite newspaper (e.g. sports and
business). Pick two representative paragraphs as text data for exercises 1-4.
1) For the two paragraphs chosen, list the words and frequencies of
words that occur in the paragraphs. Study the distribution of words, and see
if you can identify index words that can distinguish the two paragraphs.
2) Apply stop-word elimination, and build an inverted index into the two
paragraphs.
3) For each of the index words, compute the TF-IDF measure with each
of the paragraphs, and represent as a matrix of section versus index words.
4) Use the matrix computed in (3) to determine the section that's matches
the most for each of the index words used as a query. Repeat for pairs of
index words.
5) Discuss the idea behind Latent Semantic Indexing. What are the pros
and cons?
6) Illustrate how thesaurus can aid in information retrieval.
7) The set of document relevant to a query Q is {Dl, D2, D3, D4, D5}.
The set of documents retrieved by an IR system is {D3, D4, D5, D8}.
Compute the precision and recall.
8) Design and develop a prototype information retrieval that can take as

input a set of categorized text documents, and build a set of index words. In
response to a query, the system should be able to use either the TF-IDF
measure or inverted index to retrieve relevant best matching documents.
9) The IR system in (8) can be used in "classifier mode". Given a new

document, extract the index words from the document. Use these index
words as query to find the "class/category" of the closest matching set of
documents. Add the new document to that class. Repeat this over a set of
documents. Study the data and results to see if documents are mapped to the
"correct" categories.
Chapter 5
Web Search and Directory
The world is the ever-changing foam that floats on the surface ofa sea ofsilence
Abstract: Search and directory are some of the earliest applications of the World Wide
Web. In this chapter, the general architecture of a Web search system is
presented. An important aspect of a Web search system is the Web crawler
that crawls the World Wide Web by following links and storing this
information for processing. The issues in crawling, and how queries are used
to retrieve documents is presented. Variants in search such as focussed
crawling, meta-search, and dynamic search are also discussed. An overview of
Web directory systems and different methods of constructing Web directories
are described. Ranking algorithms such as PageRank™ and web 'social
structure' extraction using Hub/Authority analysis are also covered.
Keywords: Web Search, Web Crawler, Ranking, PageRank, Hub, authority, dynamic
search, focussed crawler, topic distillation, web directory, web classification
taxonomy, automatic taxonomy generation, relevance feedback.

1. INTRODUCTION
Some of the earliest applications that fuelled widespread adoption of the

Web include the web directory and search. The Yahoo directory, for
instance, was established in order to classify web sites into specific
categories, making it easier for users' to browse through the different
categories in order to access specific information. At the same time,
development on crawlers and agents that automatically explored, and
indexed web documents enabled the development of web search engines.
At the onset of the World Wide Web, systems indexed a few hundred
thousand documents, and searching of these documents did not pose major
technical hurdles. In fact, one of the earliest Web search engines called
World Wide Web Worm (WWWW)[McBryan 1994] indexed just 110,000
pages and accessible documents. However, with the rapid growth in the
number of web pages, and the increased amount of dynamic or changing
content, web search and directory systems are somewhat challenged. Over
the last five years, improvements in the technology have enabled building
directories out of hundreds of millions of documents, periodic crawling of
the Web to find new documents and modify information indexed in
previously traversed documents, and provide the ability of searching through
the vast amount of information from all around the world. Such search and
directory systems should be able to handle billions of requests at major sites
today. This chapter summarizes some of the basic architecture of web search
and directory systems, discusses the challenges, and outlines areas of future
improvements.
2. WEB SEARCH
2.1 Web Search System Architecture
The goal of a Web search system is to collect data from the World Wide
Web, index this data, and extract relevant documents from this database in
response to a user's query. The first step is the collection of the pages from
around the Web. With the Web containing hundreds of millions of
documents, this is an expensive task both in terms of computation and
storage. The Web crawler is the component of the Web Search system that
performs the crawling. Crawling is essentially exploring the Web in order to
find documents and their contents. The output of crawling is list of URLs
and the retrieved documents that are stored in a compressed manner. The
storage of the documents is generally distributed across a farm of storage

servers.
The next component of the search system is the indexer. The purpose of
the indexer is to create indices from the crawled documents, so that in
response to a query, an efficient lookup can be done to retrieve documents.
Thus the indexer performs tasks such as parsing of the documents, extracting
anchor texts (associated with links), generating the lexicon, and other such
related tasks. The steps of crawling and index generation can be viewed as
the "representation phase" of a Web search system.
The second phase of the search system is the retrieval. The core
component of the retrieval phase is the search module. A user's query is
processed (with techniques such as query expansion) and submitted to the
search module. The search module uses the inverted indices created by the
indexer and retrieves the relevant set of documents. These documents are
then ranked using various algorithms. Figure 17 summarizes the overall
architecture of a Web search system.
Crawler Storage
Server
Indexer Doc.
DB
Query
1---'" User
\
Figure 17. Overview of Web search system.
2.2 Crawling Phase
One of the important aspects of Internet applications like search engines,

web miners, and agents is the need for accessing the infonnation on the
WWW, and collating the collected infonnation in an useful fonn. Programs
that serve such a purpose are generally referred to as web crawlers. These
programs "crawl" the web, access infonnation from various web sites,
classify them, represent the infonnation in an efficient manner, and
periodically repeat this process in order to maintain the temporal validity of
the stored infonnation.
While the basic notion of web crawling is fairly straightforward, there are
two important difficulties inherent in the problem of crawling the World
Wide Web. The WWW contains hundreds of millions of pages, and growing
at a rapid rate. Even with the most powerful mUlti-processor systems,
prominent web crawlers can cover only around 30-40% of the WWW.
Furthennore, the time and memory requirements to crawl are overwhelming.
The duration of a crawling cycle takes anywhere from a few weeks to
months.
• Crawler Architecture
Figure 18 shows a schematic block diagram of a general-purpose

crawler. The crawling procedure usually starts with a few seed URLs where
the crawling begins. The core of the crawler engine creates crawler threads,
each of which processes a particular URL by accessing the document URL
through a URL server. The crawl graph module maintains the crawl graph
structure and also keeps track of whether a URL has already been processed
or not. After an URL is processed, its unprocessed children are added to the
crawler URL queue for later processing. The document infonnation obtained
by the crawler threads is passed onto other modules for indexing into a
database/representation for fast retrieval and analysis. The URL Queue
maintains a list of unprocessed URL nodes. If the URL Queue is
implemented as a FIFO\ then the crawling is breadth-first, where as if its
LIFO, then the crawling is done depth-first. Other metrics such as URL
4 FIFO stands for "First In First Out" and LIFO stands for "Last In First Out". In the context
of crawlers, this applies to the order in which un-traversed URL links are stored and
processed further.
analysis, document content, and last modification history can be used to

order the URLs for different crawling priorities. [Najork & Wiener 2001]
suggest that breadth-first crawling yields high-quality pages. [Cho & Garcia-
Molina 2000a,b] present many interesting approaches to crawling. Various
policies such as uniform allocation policy (same rate of crawling of each
page), and proportional policy (crawling more often for changing pages) are
evaluated. They conclude that if pages change in varying rate, then it is
better to perform uniform crawling. Analysis of a subset of the Web gave
some interesting results [Cho & Garcia-Molina 2000a]: 40% of .com domain
change daily in contrast to .edul.gov domains where over 50% do not
change. The modules "distiller", "classifier", and "taxonomy" are pertinent
to a particular enhancement to crawling called "focussed crawling" and are
discussed later. For now, these modules can be ignored.
: m.Portul,. M_lriu
rrocus"1:
~
~~l~~:';
1="
1 4 - - - - , - - - CRAWL
rF---~-;;;;;_~...,
:- :
. . . . . - - - - - - ' ; SUDUau :
,----.L.._---, , :
Figure 18. Web Crawling System

• The Crawling Algorithm

Table 28. Web Crawling Algorithm
I. Initialize the URL Queue with seed URLs, Crawl Graph
2. While URL Queue not empty and Crawler Limits not reached
3. Get Next URL document from Queue
4. Access URL server and retrieve document
5. Parse URL document to extract the following:

a) Keywords used for indexing (e.g. from META tags)
b) URLs within the page and anchor tags
c) Retrieve and pre-process actual document content
6. Index document into IR database
7. Mark link as been visited
8. Add/update Crawl graph with outgoing links
9. Ifany outgoing link is not marked as "done", then add URL to Crawler Queue and repeat
from step 3.
The steps of the basic crawling algorithm are summarized in Table 28.
The first step of a crawler is the initialization phase, wherein all the relevant
data structures are reset, and initialized. The crawling procedure begins
when a list of "seed URLs" are provided. The crawling begins by accessing
these seed URLs, and proceeds with the algorithm. The seed URLs are
usually put in the crawler URL queue. Then the main thread is started. This
master thread checks the URL queue, and retrieves the next URL to be
processed. If the number of threads is within the specified limit, then this
URL is passed to one of the "crawler-slave" threads. Each "crawler-slave"
thread communicates with a web server and accesses the document. Some
form of rudimentary parsing can be done at this stage itself, depending on
the actual needs of the crawler.
The next step of the document analysis is parsing the document. At the
simplest level, the meta search tags can be used to classify each document.
However, many documents are not tagged or improperly tagged, and further
analysis of the document is often needed. Researchers have also proposed
analysis of the actual link, and the anchor text for a link in order to get
relevant information about the document itself.
After the processing of the document is completed, this URL node in the
crawl graph is marked as visited. The children/outgoing links from this
document are checked to see if there are any unprocessed links. The
unprocessed links are then put into the crawler URL queue for later
processing.
Commonly specified parameters include the crawl depth, importance

functions, whether the crawl is depth-first or breadth-first, and what kind of
document processing should be done. Other pattern matching specifications
such as crawl all websites that have a URL of the form *.edu may be
allowed.
The crawl graph is essentially a graph structure where nodes are

associated with URLs and arcs connect the node to outgoing links from that
node document. The crawl graph usually stores additional information such
as:
(a) URL
(b) Whether a node has been traversed or not
(c) Importance/Rank
(d) Classification tokens/keywords/anchors
Since the crawl graph essentially represents the whole structure of the
WWW (or the parts that are represented by the crawler), it is important to
compact the individual structures. For instance, many of the above attributes
can be packed tightly either using manual coding or coding algorithms such
as Huffman coding. Since the children URLs share prefixes with the parent,
the crawl graph structures memory requirements can be decreased
substantially by having a delta representation of the URLs. Another
important aspect to consider in a crawler is DNS caching. A large portion of
the time is spent in DNS resolution of the URLs, and having a DNS cache
significantly speeds up crawling time.
2.3 Indexing Phase
The indexing phase processes all the documents retrieved by the

crawlers, and creates "hit lists" or list of count of word occurrences, and
indices into documents. The three steps in index creation are:
• Parsing
Before documents are analysed and indexed, the documents need to be

parsed and "cleaned". A number of HTML pages (especially before the
XML standard was defined) had incorrectly nested tag elements, and even
typographic errors. These need to be cleaned before further processing can
be done using the document data. Parsing the document also enables the
identification of useful indexing words, and terms for document analysis.
Examples include extraction of anchor and meta tag words for indexing, and
document body extraction for document analysis.
• Index creation and storage
The next step of the indexing phase is the creation of forward, and
inverted indices. As an example, we will illustrate the data structures
required by using the anatomy of the early Google prototype search system
[Brin & Page 1998]. The Goog1e engine maintains the following data: hit
1i$ts, forward index, and inverted index. A hit list is a "list of a particular
word in a particular document including position, font, and capitalization
information"[Brin & Page 1998]. They use two notions of hits: fancy and
plain. Fancy hits correspond to matches in URL, title, anchor text or meta
tags, whereas plain hits represent other matches. The forward indices are
"partially sorted" and stored into distributed databases called "barrels".
Each word in the lexicon is assigned a wordID, and each documentlURL is
assigned a docID. Each barrel corresponds to a range of wordIDs, and the
list of docIDs that have words in that range are stored in that barrel as a
sequence of hit lists.
The next step is the creation of inverted indices. This is achieved by

looking at the hitlists stored in each barrel, and for each wordID determining
the list of docID and position of occurrences. This is now stored in the
barrel, and a reference stored in the lexicon for every wordID. The docIDs in
the inverted index list for a corresponding wordID can be ordered in many
ways, for instance sorted by docID, or sorted by importance in document. In
the former case, it is easy to integrate results from multi-word queries,
whereas the latter enables better retrieval for single word queries.
2.4 Retrieval Phase
The purpose of the retrieval phase is twofold: analysis of the query to

weight or expand the query, and extract the documents that best match the
(possibly modified) query. This can be a single phase, or even an iterative
mechanism of refinement. It will soon be clear why the inverted dodD index
lists have been maintained.
The retrieval phase takes the set of query words and using the inverted
dodD index lists, retrieves the set of dodOs that contain the queried words.
These are then merged into a final set of dodOs that contain all the query
words, and ranked to generate the result of matching documents. These steps
are summarized in the algorithm below:
• Perform term reweighting and query expansion (as required)

• For each index term in the query, retrieve the set of dodOs using
the inverted index list.
• Process constraints on query (e.g. AND/OR operations) and
apply to the sets of dodOs returned.
• Rank the dodOs and return list of sorted matching documents.
• If relevance feedback, track which returned documents are
marked as relevant by the user (implicit or explicit)
• Refine query, perform focussed crawling, and return the first step
(if iterative procedure).
It is unclear whether many of the text retrieval pre-processing steps are

useful in the context of the Web. Stop word elimination is defninitely a
useful step. Results on stemming are mixed: often the actul unstemmed
index word is more accurate indicator of the documents the user is seeking.
Simple document re-weigting based on relevance feedback is more useful
than actual query expansion or term reweighting. One of the major problems
with the Web data is the presence of a number of noisy index terms. There
are many methods of ranking, and this topic is discussed in a later section.
2.5 Issues in Web search

While the web search engine is fundamentally not different from the
classic IR text retrieval engine, the Web consists of a very large collection of
documents from diverse sources, in different formats and languages, with
additional constraints such as web linkage and usage patterns. The following
issues add complexity and pose difficulties for web search systems:
(i) Dynamic nature of web documents.

One of the foremost difficulties with searching and indexing the WWW
is the dynamic nature of its content. Some crawlers keep track of the rate at
which the contents of a web document changes, and use this information to
alter the frequency of update (i.e. pages that change often are more
frequently crawled). Despite such heuristics, the dynamic nature of the web
documents poses a problem for the search engines and results in the retrieval
of documents using outdated data. Furthermore the WWW links could also
be "dead", i.e. no longer in existence, and the crawler/search system needs to
cope up with such cases efficiently.
(ii) Sheer size of the problem.

As mentioned in earlier chapters, the WWW consists of hundreds of
millions of documents. Indexing and storage of such a huge amount of
information takes a lot of disk space. For instance, the Google search engine
discussed earlier indexed around 24 million pages (total size of 147.8 GB),
and compressed the data to 53.5 GB. However, since then, many
improvements in compression methods, and cheap availability of both
computing and memory resources have alleviated the problem. A large
number of Web documents can now be stored and indexed efficiently using
hundreds of machines in a distributed fashion.
(iii) Difficulty in evaluation.

While precision and recall are useful metrics in standard IR datasets, it is
difficult to apply these evaluation metrics to web search engines. This is
because we need a fixed test bed of documents that are marked as relevant or
irrelevant to a set of queries. Furthermore, it is difficult to grade the quality
of these documents with respect to a query (since one search engine may
want to weigh pages from one company higher due to commercial
considerations). [Hawking et al 1999] suggest computing precision and
recall on the arbitrary number of search results returned by the search
engines. For instance, if the search engine found 8 relevant documents in the
first 10 pages that it returned, then we can say that the engine has a precision
of 0.8 at 10 documents retrieved.
(iv) Diversity of data.

The data on the WWW is extremely diverse and contains a number of
topics. The Yahoo directory alone for instance lists hundreds of categories
and subcategories. Thus, it is extremely difficult to build general-purpose
search engines that cater to every possible query. In a later section, we
discuss emergent ideas such as focused search. Furthermore, web documents
contain a lot of extraneous information absent in "clean textual corpora" like
markup information, hypermedia, and a lot of information unrelated to the
main theme of a web page (if there is any).
(v) Interactive nature of web search.

Web search can be an interactive process. Not much research has been
done in this direction. Grouper[Zarnir & Etzioni 1998] clusters the search
results and presents the clustered results for the user to pick from. On
another front, the relevance feedback given by the user can be used to
improve/adapt the document models and/or term weights[Gudivada et al
1997].
(vi) Sparsity of Query.

It ••• the average query submitted by users to World Wide Web search
engines is only two words 10nglt[Croft et al 1995]. Search systems try to

address this problem by performing different methods of query expansion
using the original query. Nevertheless, this statistic highlights the difficulty
of the web search problem. More recently, Natural Language Processing
technology is used in conjunction with sentential queries in order to attempt
to provide users with pertinent information. Tighter integration of natural
language systems and information retrieval systems may alleviate the
problem of web information retrieval in the future. With the current
technology, it is unclear whether any significant improvement can be
achieved in the short-term using Natural language processing techniques.
(vii) Spamming.
Another important aspect that makes the development of useful and
accurate search engines difficult is the notion of "spamming"[SciAmer
1999]. The web page designer includes a list of keywords (such as
"cheapest, best sale, best book" etc.) in order to try and deceive an indexing
system into associating the document with those keywords. Sometimes, the
developers write these words repeatedly over colors that are invisible to the
readers. One solution is to discard terms that have either very high or very
low frequencies. Spamdexes will be a part of the (unusually) high frequency
terms, and thus be discarded or reweighted.
3. VARIATIONS IN SEARCHING
3.1 Meta-Search
Meta-search can be viewed as a search technique that votes using a set of

experts. The same query is submitted to different search engines, and the
meta-search engine result generated from this collective information.
The simplest example of a meta-search engine would be majority voting.

The query is submitted to a number of search engines. For each link, the
total count of "votes" received for that link is kept, and the links ordered
using this count. Of course, it is likely that the search engines produce
different results with little overlap.
Search Engine A
Que Meta-Search
Engine Search Engine B
Results L..-. -...J
Search Engine C
Figure 19. Meta-Search Engine.
The more general approach towards meta-search engine development is

based on principles of machine learning. One variant of the "hebbian
learning" approach would be to assign positive weights to a search engine if
the user picks its result over other search engines. For instance, SavySearch
[Dreilinger & Home 1997] implements such an approach where a matrix of
query vs. search engine is maintained. If the user picks a result from a
particular search engine, then the counts for that search engine and those
query words is incremented. If the search engine does not produce any result
at all, then the counts for that search engine and query words is decremented.
The matrix can also be adaptively be updated in response to users choices.
3.2 Relevance Feedback
In the previous section on meta-search, we saw the application of user-

feedback for modifying the meta-search engine confidence on individual
search engines. The same approach can be applied to a single search engine.
Relevance feedback can be used to modify the user's original query, or to
modify the document representations. Three common methods [Gudivada et
al 1997] of performing query modification are query term weight
modification, query expansion, and query splitting.
Query expansion can be perfonned using query specific methods or

corpus specific methods. In the query specific approach, the top documents
retrieved in response to the user's query are analyzed, and new tenns
extracted from these top-matching documents. In a corpus specific approach,
the query expansion set is manually constructed or automatically generated
(e.g. see [Gauch et al 1999]). The query is expanded based on the user's
final link selection (i.e. if the selection contains the expanded tenns then the
expansion tenns are associated with higher tenn weights). Document vector
modification involves moving the "relevant document vector" closer to the
input query [Bhuyan et aI1991].
3.3 Focussed Crawling
Just the sheer number of web documents in the WWW is overwhelming.

It is valid to question the feasibility of building a general-purpose crawler
and search engine that crawls and indexes infonnation from over 350 million
pages. [Chakrabarti et al 1999] argue that the problem with this approach is
that such general-purpose systems try to cater to every possible query.
As an alternative, [Chakrabarti et al 1999] propose the notion of focused

crawling. A focused crawler is typically given an input query or sample
document and crawls using this document/query to decide which pages to
crawl through; hence the tenn "focused crawling". A focused crawler has
three components (see Figure 18 earlier): a classifier, a distiller, and a
crawler with reconfigurable page crawl priority. The idea is for the classifier
to classify the crawled documents, and the distiller to re-rank the importance
of the crawled pages. This is used to reorder the crawl priorities of the
unvisited pages, or reconfigure the revisit priorities of nodes in the crawl
graph. The distiller essentially identifies good hubs for crawling, and thus
improves the rate of harvesting. Focused crawling presents itself as a useful
alternative to post-processing search query results by searching effectively at
the crawling stage itself.
3.4 Dynamic Search
A dynamic search system is one that actually fetches documents from the
WWW in response to a query for relevance analysis. On the other hand, a
static search system uses a precompiled infonnation repository and finds the
best match to a query. Example systems implementing dynamic search are
FishSearch[De Bra et al 1994] and WebGlimpse[Manber et al 1997].
FishSearch was designed on the hypothesis that relevant documents often

have relevant neighbors, and expanded the search using the retrieved
documents. WebGlimpse defined a 'hop' parameter that allows users to
dynamically search sub-areas within that depth. Fetuccino [Ben-Shual et al
1999] is a system that augments static search with a dynamic search
component. The idea is to get a set of starting points from a static search
engine and then fetch documents that are connected to these sets of initial
documents. A two-phase system is also described in [Ben-Shual et al 1999]
wherein the user specifies a domain term and a query term. The domain term
is used to identify the domain, and the query term is searched for within this
domain. The domain search is guided by the notion of central pages on the
web (i.e. hubs/authorities described in the next section). SharkSearch
[Hersovici et al 1998] improves on FishSearch by propagation of relevancy
scores from parent to children with a decay factor, in addition to other
improvements such as smooth similarity scores, and utilization of meta
information contained in links.
4. RANKING
4.1 Ranking Links
It is often useful to rank the utility of links that are outgoing from a
particular document. For example, if we want to build a IR system that
retrieves medical web pages, then links to local bakeries from a user's
website are often uninformative. It is important to screen out the
uninformative links from the useful/relevant links.
Importance metrics are metrics computed in order to rank the URLs in

order of relevance/utility for the crawler to use. The basic philosophy is "not
every page is equal". Various techniques [Cho et al 1998][Kleinberg
1998][Najork & Wiener 2001] used in determining importance of links are
discussed in the following sections.
4.2 Textual Similarity
One approach to evaluating a document is to determine the textual

similarity between query and document. In this metric, the document vector
distance between the user query and the document is computed in order to
rank the importance of the document.
4.3 Back link count
This measures the number of links that have a link into this document.
This is defined as the ratio of number of links to this document divided by
the total number of URLs.
4.4 Page Rank
Based on scientific citation analysis, PageRank™ is a variant of the back

link count. In the back link metric, all the back links are treated equally. In
the page rank metric, the back link count of the parent document is used to
compute the page rank metric of the document in question. Thus, a recursive
procedure is used to determine the page rank of the document.
[Brin & Page 1998] describe the PageRank algorithm used in the Google
search engine which utilizes the link structure of the Web to calculate a
quality ranking. The PageRank algorithm is motivated by the concept of
citation importance and is defined as follows[Brin & Page 1998]:
PR(A) = (1- d) + d(PR(T,) + ... + PR(Tn))

C(TI) C(Tn)
PR(A) refers to the Page Rank of web document 'A'. T1 ..T n are links that
point to the document 'A'. C(A) is defined as the number of links going out
of document 'A'. More recently, [Henzinger et al 1999] have used the
PageRank function in conjunction with random walks on the web to estimate
the quality of web search engines.
What is the intuition behind PageRank? Assume a model where a user

starts clicking from one page to another by choosing one of the links in that
page, and always keeps going in the forward direction. In such a scenario,
the probability that the user visits a site is related to the PageRank function.
The main difference is the addition of the damping factor ('d') that is used to
bias a document or group of documents for added preference. Thus, if many
documents point to a page, then its page rank tends to be higher.
4.5 Location metric
The location metric uses the URL itself as an indicator of the importance
of the document, not the contents. Thus a crawler searching for professors
homepages might just look for the "* .edu" URLs and assign them better
values.
4.6 Hubs and Authorities
In the previous section, we have been trying to incorporate the notion of

page importance on the basis of the crawl graph structure. In this section, we
describe more general notions of useful nodes: hubs and authorities.
Authorities refer to web pages that contain useful information relevant to

a particular broad topic. Thus, the web site of "w3.org" is a good authority
page about the WWW. Hubs, on the other hand, are defined as web sites that
link to a lot of useful information, i.e. reference many authorities. Thus, sites
such as resource lists on particular topics are good hubs. The concept of hubs
and authorities may be viewed as a form of "social structure" [Gibson et al
1998] exhibited in the evolution ofthe WWW.
• The fiTS Algorithm
How do we determine whether a site is a hub or an authority or neither?

Clearly, authorities are highly referenced pages about a specific topic; Hubs
link to highly referenced pages about a specific topic. Thus, hubs and
authorities exhibit a mutually reinforcing relationship. Many algorithms for
finding "useful" nodes in web graphs use graph theoretic measures, and
incoming and outgoing link counts. The Hyperlink Induced Topic Search
(HITS)[Kleinberg 1998] algorithm to determining hubs and authorities is
briefly summarized below:
Table 29. HITS Algorithm

HITS algorithm
1. Using query, find root set of documents
2. Expand root set with pages pointing to and from the root set pages.
3. Associate with each page, a hub weight h(s), and an authority weight a(s)
4. Repeat steps 5-6 till convergence
5. Update using the equations shown below.
6. Normalize a and h
7. Pick the set of pages with the top h and a scores as hubs and authorities respectively.
h(s) = La(q)
s-'>q
a(s) = Lh(q)
q-'>s
The HITS algorithm is illustrated on the graph shown in Figure 20.
Figure 20. Graph Structure used to illustrate the HITS algorithm
Table 30 and Table 31 trace the values of the authority and hub scores
respectively, for these nodes at different iterations of the HITS algorithm.
After a few iterations, the system converges to the values shown in the last
row of Table 30 and Table 31. Hcan be seen that node w2 has the highest
authority score, while node wO has the highest hub score. This is intuitive
from Figure 20 since node w2 has the maximum incoming arcs. wO has 2
outgoing nodes (same as node w2), but the nodes it points to have better
authority than any other node.
Table 30. Authority scores for iterations of the HITS Algorithm

Iteration WO WI W2 W3 W4
1 0.20 0.20 0.20 0.20 0.20
2 0.14 0.29 0.43 0.00 0.14
3 0.09 0.36 0.36 0.00 0.18
4 0.04 0.35 0.48 0.00 0.13
17 0.00 0.37 0.50 0.00 0.13
Table 31. The Hub scores for iterations of the HITS Algorithm
Iteration WO WI W2 W3 W4
I 0.20 0.20 0.20 0.20 0.20
2 0.29 0.14 0.29 0.14 0.14
3 0.33 0.D7 0.20 0.20 0.20
4 0.35 0.04 0.26 0.17 0.17
17 0.37 0.00 0.21 0.21 0.21
It can be shown that the HITS procedure converges to stable authority

and hub values. [Kleinberg 1998] has shown that these equilibrium weights
T
correspond to principal eigenvectors of the matrices ATA and AA (under
certain assumptions). Here, A is the adjacency matrix of the web graph.
5. WEB DIRECTORIES
Web directories such as Yahoo! consist of hierarchical Web subject

directories. The basic directory structure is compiled by a group of expert
"surfers" who decide whether a site should be in the directory, and what
categories it should fall into. Thus, manual directory creation and
maintenance requires extensive effort. The URLs to be listed in the directory
are either collected by crawlers/robots or submitted manually for addition.
Another approach is a human-assisted directory generation. In such a

case, the taxonomy is pre-determined by an expert taxonomist or the
directory manager. Certain keywords can be associated with each category,
and the system classifies the document as possibly belonging to a number of
categories. The surfer then uses an editing tool to use this information to add
to appropriate location in the category hierarchy.
Yet another variant of semi-supervised directory generation is automatic

classification based on a priori classified documents. In this approach, the
taxonomy is predefined, and a set of sample documents are assigned to each
category, or leaf of the directory structure. Newly submitted documents are
matched using document cluster distance metrics such as the TF-IDF

measure discussed in an earlier chapter in order to potentially assign
documents to one or more of those categories.
Figure 2 J shows an example of such a semi-automatic directory

generator. In this case, the human expert predefines the whole taxonomy. At
the root of the taxonomy tree, a "prototype document vector" can be
constructed for matching with the actual web documents. The classifier can
be a soft classifier that allows documents to belong to multiple classes, or a
hard classifier that enforces that every document must belong to only a
single class. A class could be a specific category such as
"travel:USA:Florida:theme parks". Since the number of documents is
generally huge, the vector space representation of the documents is clustered
into a number of classes, and the whole cluster of documents assigned to that
matching leaf in the taxonomy hierarchy tree.
Web
EXPERT LABELLED
CATEGORIES
Documents
DIRECTORY
TAXONOMY
CLASSIFIER
Document
Clusters
Figure 21. Web Directory - fixed taxonomy, but automatic classification.
An alternate approach is a semi-supervised approach to hierarchical

taxonomy generation. In this case, a human expert specifies broad
categories. The web documents are classified and assigned to each leaf of
this "broad category tree". Next, the unsupervised classifier (see later chapter
on "Web Mining" for more descriptions of classifiers) can automatically
split these sets of web documents into separate clusters, and determine the
terms that allow discrimination between the split classes. This process can be
repeated until a split cannot be achieved with high confidence. For instance,
the broad category at a leaf node of the taxonomy could be "four
wheelers:cars". The system can then cluster the Web pages in this category
and automatically determine for instance that the car type (e.g. luxury, mid-
size etc.) differentiate a number of the pages, and thus add the category
"type" to the taxonomy tree.
BROAD
CATEGORJES
Web O<x:umeou
DIRECTORY CLASSIFIER
TAXONOMY
Subcltegory Documenl
CluJIC:T'S
Figure 22. Example of Semi-Automatic Taxonomy Generation.
The semi-supervised approach where a broad taxonomy is specified and

the system allowed to automatically derive further classes lower down in the
taxonomy hierarchy can be summarized as shown in Figure 22. Further
description of decision trees for classification can be found in the next
chapter on Web mining. Although seemingly an attractive solution,
automatic techniques for taxonomy generation still have a long way to go.
The problem of automatically building a hierarchical taxonomy is not only
applicable to general web directory construction, but also to other
applications such as a directory for listing shopping items, or job/real-estate
listings.
[Chekuri et al 1997] discuss a system where classifiers are trained using

the manual classification provided by Yahoo!. Once the initial document
vectors are constructed, other documents are classified using this classifier.
Finally, a user can specify both the category and the query in order to
retrieve documents from that specific category. [Chakrabarti et al 1998]
present an interesting approach to scalable feature selection during the
process of directory generation. The TAPER system that they developed
separates "feature" and "noise" terms at every node in the taxonomy. They
do this by computing discriminant scores for each feature, and classifying
documents using only the feature terms. The feature terms are found by
using a variation of Fisher's discriminant method. They compute a figure of
merit, called Fisher index, that is the ratio of the between-class scatter to the
within-class scatter:
ICU(Cpt)- j.l(C 2,t»2

c1,c2
I 1
-(x(d,t)- j.l(c,t)r
c Ic I
Here f..l(c,t) is the mean vector of all the elements within a class 'c' for
each term 't'. cl and c2 are two classes, and x(c,t) is a sample from the
document training data. The TAPER[Chakrabarti et al 1998] system has
been applied successfully to train a taxonomy using 266,000 web documents
from 2118 Yahoo! Classes.
Directories also support a directory search feature. In this case, each

category is represented by a index representation of the documents that
belong to that category. In response to a query, the list of nodes that are
closest to the query are examined to decide which categories to display to
the user.
6. CONCLUSION
In this chapter, we discussed the design of web directory and search

engines. Directories can be generated using an expert defined taxonomy, or
can be semi-automatic. Various types of web search engines have been
summarized including meta-search and focused search engines. Meta-search
allows combining the information from multiple search engines for a given
query. Dynamic search can allow the search engine to retrieve and analyse
documents relevant to a search query. The notion of relevance feedback, and

ranking is also presented. While the basic algorithms are derived from
classic IR techniques, a variety of issues that make the web search and
directory problem difficult are presented.
The overview of a web crawler architecture has been discussed in this

chapter. The various components of a web crawler include URL queue,
crawl graph manager, and node reordering module. One of the primary
difficulties of crawling the WWW is the sheer number of documents (over a
billion), and ranking of these documents once retrieved. The notion of "a
social network" is useful in the analysis of web graph structures to determine
authoritative pages and hubs. The idea of "focused crawling" allows the
crawler to crawl efficiently and retrieve documents relevant to a particular
topic.
FURTHER READING
[Brin & Page 1998] is a good tutorial paper on building the Google Web
search system. [Bharat & Broeder 1998] discuss methods of measuring
overlap and sizes of Web search engines. Focussed crawling, topic
distillation, and feature selection are discussed in [Chakrabarthi et al
1999],[Bharat & Henzinger 1998], [Chakrabarthi et a1 1998a], [Chekuri et al
1997] and [Charabarthi et al 1998b]. [Cho et a1 1998], [Miller & Bharat
1998], [Aggarwal et al 2001], [Edwards et al 2001] and [Najork & Weiner
2001] discuss methods of URL crawling. Recent variants of dynamic search
is discussed in [Ben-Shua1 et al 1999], and meta-search in [Dreilinger &
Home 1997],[Wu et al 2001]. Some challenges in web search evaluation is
presented in [Hawking et al 1999]. Web graph analysis is covered in [Bharat
et a1 1998], [Gibson et a1 1998], [Hezinger et al 1999], [SciAmer 1999],
[Kleinberg 1998], and [Lawrence & Giles 1999]. [WWW
SearchEngineWatch] is a good online site that is updated with useful
information on Web search and directories.
EXERCISES
1) Design and construct a Web crawler that starts off a set of seed URLs
and iteratively retrieves documents.
2) Study different techniques for choosing URLs to crawl such as

breadth-first, depth-first, and pattern matching on URL (e.g. same domain
first). Evaluate which method seems to be more efficient and retrieves "more
pertinent" documents.
3) Design a pre-processing phase for the indexer which parses and

"cleans up" the HTML pages that are retrieved by the crawler.
4) Index the documents retrieved in (4) using forward and backward

indices. What are the issues in index creation?
5) Search component design: In response to a query, retrieve the closest

matching URLs. Rank these with different ranking schemes and contrast
results.
6) Identify a set of categories and build a directory to index the first 20

documents retrieved by the crawler.
7) For the following graph connectivity, apply the Hub/Authority and

PageRank algorithms. Contrast results of the two approaches.
Figure 23. Web Graph for exercise (7).
(8) Contrast the Web Search and Directory approaches to Web access.
(9) Implement a dynamic search system that augments the URL's

retrieved in (5) by doing a limited crawl from the result URLs retrieved in
(5) and re-ranking. Discuss the advantages and disadvantages of such an
approach. Contrast approach with a meta-search system.
(lO) How can relevance feedback be used to improve the quality of

search?
Chapter 6
Web Mining
In the mountain, stillness surges up to explore its own height; in the lake, movement stands still
to contemplate its own depth
Abstract: One of the tremendous advantages of the Web is the vast amount of
infonnation logged about user accesses. TIris infonnation can be mined to
gain deep insights into the Web site, its usage, and visitation trends and
other patterns. The basic techniques in data mining such as association
mining, classification, clustering and sequence analysis are first presented.
Applications of Web mining including server log mining, link analysis,
user trend analysis, collaborative filtering for recommendation and
adaptive web site organization are covered in this chapter.
Keywords: Association rules, classification, regression, clustering, data mining, Web

log analysis, click-stream analysis, user data analysis, recommendation
systems, collaborative filtering

1. INTRODUCTION
Millions use the Web to search for infonnation, communicate with each
other, purchase goods or access personalized data. This results in billions of
page views - a valuable source of infonnation. Infonnation about users'
interests, about what users' want, and about what users' search for. Such
infonnation can provide deep insights on how users' interests match with
each other, what they find appealing in a site, or even trends and personalized
preferences. The task of Web mining is to use this huge amount of
infonnation in a useful manner to extract trends, and predict user preferences.
It is important to note that a crucial aspect in the discussion of Web

mining is privacy. Users' enter their personal infonnation such as address,
and preferences with implicit and explicit expectations. Any use of this
infonnation is subject to the user's approval, such as a privacy agreement
that the user accepts when entering the site (see P3P discussed earlier). Even
implicitly tracked infonnation such as cookies, and web sites that the user
goes to are subject to such restrictions. Furthermore, it is important to protect
such data from falling into the wrong hands or misuse by other malicious
people.
2. DATA MINING
The collection and analysis of data for the purpose of determining trends
and customer preferences is not new. This problem has been used in many
retail business, and finance stock market forecasting applications. More
recently, the area of data mining has focussed on mining large amount of
data stored in huge databases, a field also termed as "Knowledge Discovery
in Databases" (KDD). The focus of data mining is to extract patterns from
data. The seven steps involved in KDD are:
a) Data selection/Sampling
b) Preprocessing of Data
c) Transfonnation/Reduction of data
d) Data Mining
e) Generation of Pattern and Models

t) Evaluation
g) Visualization
The first step is to collect data and store it in a database. The next step is
selection of relevant data to do the analysis on. Often it may be useful to
sample a representative subset of data in order to quickly estimate trends.
The raw data generally contains errors and invalid field entries. A cleanup
process called "pre-processing" needs to be done. Data mining tasks
generally involve high-dimensional data: it is possible to reduce the
dimensionality by considering subsets of features relevant to the mining task
at hand. The next step is the actual processing and mining of data in order to
generate patterns and models. Theoretically, it is possible to have an infinite
number of models with the same data- so these patterns and models are
evaluated using certain criteria to determine the utility of the derived models.
The final step is the visualization of the data using these models to see if they
provide further insights into the nature or patterns in the data. In traditional
Online Analytical Processing systems, data was stored in databases, and
queries in languages such as SQL are used to generate reports and extract
patterns in the data. Data mining, on the other hand, often uses automatic
algorithmic approaches in order to extract patterns in the data.
3. ASSOCIATION MINING
3.1 Goals of Association Mining
One of the classic problems in data mining is assocIatIOn mmmg.

Association mining is the determination of association or relationships
between items. The classic problem is the "market-basket" analysis. In the
market basket analysis, the set of items that are purchased together in a single
market basket are collected. Using this collection of data, data mining can
provide marketers methods of improving cross-sell of items based on
relations between items.
The following are the common goals for "market-basket" mining:
1. Association Rules
Association rules are of the form:

{A,B,C} ~ Y
What this means is that if we find items A, B, and C in a market

basket, then we have a good chance of finding item Y. An
example of such a rule may be:
{ bread, banana } ~ milk
Meaning of rule: "People who buy bread and bananas also

buy milk". One can imagine a number of such rules that can be
extracted from a huge collection of data. For instance, if a
number of baskets contain milk, then this could result in a
number of such rules, which do not necessarily hold any
significance. In order to quantify the "goodness" of a rule, the
probability of finding the predicted item (Y in our example) is
used as a measure of "confidence" of the rule.
Confidence C for the rule X ~ Y is defined as:
C = Count( X U Y ) I Count( X )
Confidence of the rule is defined as the ratio of the number of

basket occurrences of X and Y over the number of occurrences of
X alone. It can be seen that C is the conditional probability of Y
given X, denoted as C = P( Y I X ).
2. Causality
Causality refers to the notion of presence of items X causes item
Y to be present. This notion is best illustrated using the "diaper-
beer" example commonly cited in data mining examples. Let us
assume that the causality "diaper causes beer", i.e. diaper buyers
are likely to pick up beer. Promoting a sale on diapers will
increase diaper buyers' visiting the store, and in turn buying beer.
Increasing the price of beer will lead to improved profits.
3. Frequent Itemsets
It is not hard to see that the problem of counting itemsets grows

exponentially with the number of items. If there are n baskets and
m items, the number of association sets is as follows:
O(m.im-1»)
The computational complexity is:
We mentioned that each rule is associated with a "confidence".

Another metric is the "support" for an association rule. Support is
the joint probability of all the items in the association rule. For
the rule X 7 Y, the support is defined as follows:
Support S = Count(X U Y)/T = P(X,Y)
where T=total number of transactions.
3.2 Algorithms for Association Mining
Association mining proceeds in two steps:

a) Extract itemsets with a support that exceeds a minimum support
threshold.
b) Use the itemsets to generate association rules.
A Priori Algorithm
The two key observations used for extracting itemsets efficiently are:
• If {A, B} has support 'a', then both A and B must have at least support
'a'.
• If A or B has support less than 'a', then {A, B} must have support
less than 'a'.
The first observation states that if the combined set {A, B} has a support
of at least's', then the subsets {A} and {B} must also have a support of at
least 'a'. The second observation states that if itemsets {A}, or {B} have a
support of less than's', then the combined set {A, B} must have support less
than's'. These rules are very useful in reducing the number of itemsets
considered, in addition to providing an iterative scheme for generating the
itemsets. The A priori algorithm [Srikant & Agarwal 1995] proceeds in the
following manner:
a) Find all itemsets of size one that have a minimum support's'. (Set L{I})
b) Pairs of items from set L{I} become candidates for set L{2}. Again
compute and threshold to retain itemsets that have a minimum support
's' to generate L{2}.
c) Iterate over itemsets of lengths k, by combining itemsets from iteration
k-I, and retain those with minimum support's' as L{k}
d) Proceed upto the k needed or till the set L{k} is empty.
One of the problems is that the number of baskets is usually very large,
and the number of items is also large. This results in a large number of
itemsets of size 2, making it difficult to retain the pairs of items in main
memory. [Park et al 1995] propose an extension whereby pairs of items are
hashed into a table in the first pass. The idea behind the PCY algorithm is as
follows:
a) In the first pass, compute the frequent size I itemsets. Furthermore,
hash in pairs of items found in the basket into a hash table and
increment counts. Convert this hash table into a bit map indicating a
'I' for a frequent item pair, and '0' otherwise.
b) In the second pass, the hash table bitmap is loaded into memory. For
all the pairs of items from size I frequent itemsets, check to see if the
hash table bitmap indicates 'I', and if so create and entry and count
occurrences.
3.3 Example
Let us now illustrate the A Priori algorithm with some examples. Consider
the set of baskets shown in Table 32.
Table 32. Sample data to illustrate association mining.

Basket Items
I {A, C}
2 {C, D}
Basket Items
3 {A,B,C}
4 {B, E}
5 {B,C,D}
6 {A, B, C, E}
Let us assume that the support threshold is 50%.

The first step is to extract itemsets of size 1 with a minimum support of 50%
(or support threshold of3).
{A} : support = 3/6 = 0.5
{B} : support = 4/6 = 0.67
{C} : support = 5/6 = 0.83
{D} : support = 2/6 = 0.33
{E} : support = 2/6 = 0.33
Thus, itemsets {D} and {E} are discarded since they did not pass the support
threshold. This leaves us with the following set of itemsets of size 1:
{A}, {B}, {C}
Next, we combine these itemsets to form possible itemsets of size 2:
{A, B} : support = 2/6 = 0.33
{A, C} : support = 3/6 = 0.50
{B, C} : support = 3/6 = 0.50
Thus, the sets {A, C}, and {B, C} are frequent itemsets of size 2 and so on.
It is clear that using the minimum support item subsets in order to determine
itemsets of a higher order enables the elimination of a number of low support
candidates.
4. PREDICTIVE MODELLING
4.1 Goal of Predictive Modelling
The goal of predictive modelling is the estimation of some fields in a

database using other fields. Predictive modelling can be viewed as two types
of problems: regression and classification. In the regression problem, the
goal is to estimate the missing values using the remaining fields. In the
classification problem, the objective is to categorize the data into a specific

category.
The general problem of prediction can be stated as follows:
Given a set of features, each data point can be viewed as a point in this
feature space. This is generally called the input space (which we denote as
X). The training data consists of a corresponding set of output values or
categorical classifications (which we denote as V). The goal of predictive
modelling is to estimate a function / that maps the input space vectors to the
output space vectors Y. This estimation is done using a finite sample of
training data, and a key issue is that of generalization. The function/may fit
the sampled data points very well, but may not be reflective of the general
classification patterns exhibited by the data samples. This is referred to as
"overfitting"s. There are numerous techniques that try to overcome this
problem such as validating classification accuracy on a separate set of data
not included in the training called "cross-validation". For the classification
problem, the function / assigns the input vector into one of a number of
disjoint sets or classes. In the case of classification, the sampled data points
are assigned one of two classes A or B. The classification function has the
following form:
f(x) =1 ifx EA,

f(x) =0 ifx EB
4.2 Techniques for Predictive Modelling
Linear Classifiers
The goal of a linear classifier is to estimate a hyper-plane that divides the

set of sampled data points belonging to one class from another class. The
linear classifier is defined as a linear combination of the input vector
elements. Thus if an input data point x has the features Cl, C2, •• Cn, then
5 Overfitting indicates that the model fits the data samples too well!
With a set of data points Xl, xz, and so on, this results in a set of linear
equations that can be solved to estimate the value of the weighting
~oefficients.
Linear classifiers are capable of solving only linearly separable

~lassificationproblems. In the remaining part of this section, we will discuss
non-linear classifiers that enable separation and classification of data that are
non-linearly mapped to classes.
r.nclassical pattern recognition, linear discriminant functions have been

extensively researched and applied for many decades. Fischer's discriminant
[Fischer 1936] is a classic work on the origin of discriminant functions.
Fischer's discriminant is defined to be the ratio of the between-class scatter
matrix [Duda & Hart 1973], and the within-class scatter matrix.
Minimization of this metric can be formulated as a generalized eigenvalue
problem, and the resulting plane is a linear separator between the classes. An
important aspect of Fischer's discriminant function is that the classification
problem is viewed as a dimensionality reduction problem.
Regression
Regression is the technique used to estimate the function / in order to

predict a continuous valued function. Regression estimates the function/as a
~ombination of a set of predefined functions g, called the basis functions.
The sampled data points X" X2, X3, ... XM are used to estimate the coefficients
W,and can be posed as the linear system:
Aw=b
and
The coefficients w need to be estimated based on the "distance" between the

estimated function (x) and f(x) on the sampled data. One such distance
measure used commonly is the 2-norm. In this case, the problem of
regression becomes:
Minimize IIwll
W 2
Subject to Aw = b
In practice, the possibility that f cannot be sampled precisely is adjusted

by adding an error in measurement terms.
Neural Networks
Neural networks are mathematical formulations of computational learning

motivated by findings in neuroscience. The brain is comprised of billions of
nerve cells or neurons, which are connected to each other. These neurons
transmit signals from one cell to the next selectively. It is such a complex
interaction between each individual neuron that enables the brain to learn and
perform complex tasks.
In 1943, McCulloch and Pitts proposed a simple model of the nerve cell
as a threshold unit which forms the basis of many neural network models
today. Each neuron has a set of inputs, which are individually weighted,
combined linearly, and thresholded in order to decide whether to "fire" or
not. A "firing" neuron essentially sends a signal of "1" to the neurons its
connected to, and a "non-firing" neuron sends a signal of "0" to the neuron
that its output is connected to. This process is repeated over all the neurons in
the neural network, and a final classification result determined at the
"output" neurons. Thus the input neurons take input, the neural network
transforms this input, and the output neurons determine what the resulting
output from the network is. Each single neuron is similar to the linear
classifier with the threshold. Figure 24 shows an example of a single layer

neural network.
Input Vector
Output Layer
Figure 24. Single layer neural network.
Combinations of layers of neurons whereby the input is processed by

neurons at one layer, and the output of the neurons at one layer are passed
onto the next layer are called "multi-layer" neural networks. The multi-layer
neural networks consists of three types of neurons: "input neurons" where
input values are fed in, "hidden layers" which are only for internal
processing, and "output layer" which determines what the output generated
by the multi-layered neural network is.
Many algorithms for training the parameters of a neural network have

been proposed. Many algorithms use the "least squares approach" method in
order to estimate the parameters. The first step is to define an error function-
a quantitative measure of the error in classification of the network based on
the input-output pairs of the training data. The next step is to modify the
parameters of the neural network in order to reduce the "error" of the neural
network. A common solution is to use gradient descent technique which uses
the derivative of the error function to determine by how much to adjust the
individual weights/parameters of the neural network. This process of
measurement of error, and adjustment of the parameters of the network In
order to reduce the error is called the training algorithm.
The error is estimated by using the output that's expected (based on the
training data), and the output that's generated by the network. Thus, in a
single layer network, the error of the output from corresponding inputs is
used to modify the weights of the linear network. In the case of multi-layer
network, there are internal "hidden layers" that do not have any output
specified in the training data. How do we then modify the weights in the
hidden layers based on the final output? One popular algorithm for
determining the weights in a multi-layered neural network is the "back
propagation algorithm"[Rumelhart et al 1987]. The key concept is the
propagation of the error in the output layer to the internal layers, which is
used as an "error function" for the hidden layers used to adapt the
parameters.
Support Vector Machines
In the earlier discussion, we have referred to parameter estimation by the

optimization of a cost or error function. It is often possible to fit the training
data perfectly, but not "generalize" on the test data. For any set of training
data, there are theoretically infinite number of models that can "fit" the
training data. However, only a subset of the models "generalize" on the test
data. Support Vector Machines (SVMs) incorporate a "structure risk
minimization" into the optimization criteria which minimizes an upper bound
on the generalization error. In the case of linearly separable sets, SVMs
chooses that linear plan that the margin separating the two classes is
maximized. Margin is defined as the distance between the two points from
either classes that are closest to the dividing plane.
Decision Trees
Decision trees partition the training data recursively until each partition
entirely (or within a threshold) belongs to a particular class. Thus decision
trees are tools for classification and prediction. The decision tree consists of a
tree model where each node is either an internal "decision node" (split-point)
or a leaf "classification node". Classification proceeds by walking down the
decision tree using the unclassified data till the leaf node is reached, at which
point the predicted class of the data sample is known. Decisions are often
rule based and can be arbitrary functions of the data sample. Figure 25 shows
an example of a decision tree.
A>1?
Figure 25. Example of a decision tree for classification.
Decision tree classifiers(e.g. CART [Brieman et al 1984]) are generally

built in two phases:
• Growth Phase
• Pruning Phase
Most approaches to decision tree construction work top-down, applying

greedy search through the possible space of decision trees. One ofthe earliest
decision tree algorithms is ill3 [Quinlan 1976]. The ill3 algorithm selects
attributes that are most discriminant of the different classes, and iteratively
adds these attributes as nodes in the tree. This procedure is repeated until the
classification is accurate on all the training samples, or specified limits are
reached (e.g. on the depth of the tree). Thus, the growth phase consists of
recursively partitioning the data till the partition of the training data in each
of the leaf nodes belongs entirely (or within a required fraction) to a single
class. The procedure starts with all the data points in the root node. This data
needs to be portioned based on a rule upon a particular attribute (or set of
attributes). The two central issues in decision tree construction is the

sequence of attribute selections, and the decision rule threshold constraints
on the selected attributes. A common approach is to define an optimisation
metric that drives the choice of decision attributes, such as entropy. Another
metric for the detennination of the split points is based on a 'gini' index. The
gini index for a dataset S with 'n' classes is defined as follows:
gini(S) = 1 - ~ p/
Pi is the relative frequency of class 'j' in S.
Now if a split divides S into two subsets S1 and S2, then the gini index of
the divided data can be computed using:
The attribute that minimizes the gini index is chosen as the split point.
Since the tree is constructed in a manner to classify the training data

correctly, the risk that the decision tree model might over-fit the training
data, and not generalize properly on the test data is an important concern. A
common approach to this problem is to terminate early (before all the
training data is perfectly classified), or prune the tree. This is the second
phase called the "pruning phase". The goal of pruning is to remove any
random variation or noise exhibited by the training data, and which is not
meaningful to the classification function. One criteria that has been applied
to pruning is the Minimum Description Length (MDL) [Rissanen 1989] in
order to minimize the decision tree description length.
Statistical measures of the significance of splitting a node into a decision

node have been fairly successful in deciding whether to continue growing the
tree at a particular node or not. Cross-validation data can also be used to stop
the decision tree construction. How do we handle data with missing
variables? Since the missing variable may be at a decision node, it is needed
in order to traverse the tree and classify the data with the missing values. A
common solution is to predict the missing value using the other features in
the data vector. Decision trees are fairly fast for classification, since the class
is detennined by a sequence of decisions made during the traversal of the tree
from the root node to a leaf node. However, for large datasets, construction
of the tree should be efficiently achieved since multiple passes over the
datasets are needed.
4.3 Collaborative Filtering
In the classic collaborative filtering problem, we have a set of items, and

a group of experts. Each expert has rated a subset of the items with some
value. Thus, some of the items are not ranked by some of the experts. The
goal of collaborative systems is to use this information to estimate a judge's
rating of a unranked item, based on the similarity of this judge's ratings to
other judges.
Formally, if we have 'J' judges, and 'I' items, then the ratings can be
viewed as a J x I matrix of rating values, where some of the ratings are
missing. An example of an algorithm to solve this problem is the 8M
algorithm [8haranand & Maes 1995]. The 8M algorithm uses a linear
combination of observed ratings, weighing similar judges higher. An
example of such a rating matrix is shown in Table 33. The '.' in the table
indicate that the article has not been rated by the user. The ratings in the table
range from 0-5,5 being the highest rating, and 0 the poorest rating.
Table 33. Sample Web site ratings table

User/ltem I 2 3 4 5
A 4 5 0
B 4 3 3 4 1
C 0 0 5 0
0 5 5
E 0 5 5
The goal of collaborative filtering is to use this combined information to

estimate the missing ratings. For instance, on articles 1,4, and 5, users A and
B seem to have similar ratings. Thus, on the hypothesis that users who have
similar ratings on a subset of items will also have similar ratings on the
whole population of items, we can estimate the rating of article 2 by A to be
3. More formally, there are many techniques that can be applied to do the
estimation. The 8M algorithm predicts missing ratings as a linear
combination of observed ratings, weighing similar judges more than

dissimilar ones. The SM predictor is as follows:
I
j~ j'
W jJ,R jJ'
Where ~j is the rating of item I by judge j, and wii' is a measure of

similarity in preferences between judge j and j'. The similarity can be a mean
constrained correlation function:
I (Rij - N)(R jj' - N)

w·., = ---;::==':::'===::::::::=====
JJ 'I(Rij -N)2I(R jj , _N)2
V, ,
N is the neutral rating value, usually the midpoint of the rating scale.
Many other techniques exist for estimation of missing ratings such as a

Bayesian model [Chen & George 2000], and dependency networks
[Heckerman et al 2000].
4.4 Example of linear classifier
Let us consider a two dimensional example of season prediction. We have

two classes "Summer" and "Winter", The input variables are "sunny",
"heat": sunny is set to a value of 1 if fully sunny, and a value of 0 if fully
cloudy, and heat is set to 1 if over 70F, and 0 if less than OF. Both the
variables are graded in the intervals of 0 to 1. A set of sample data points are
shown in Table 34.
Table 34. Sample data to illustrate classification.

Data S H (Heat) Class F(x)
point (Sunny)
Xl 0 0 Winter o
X2 0.2 0.3 Winter o
X3 I I Summer I
It can be seen that winter is mapped to the classification value of 0 and

summer is mapped to the value 1. The problem is now to have a classifier
that's able to predict the season based on the variables "sunny" and "heat".
The linear classifier is defined as follows:

f(x) = ao + a\S + a2H
Thus, for the first data point x), the equation would be:
f(Xl) = ao + O*al + 0*a2 == 0
or
ao == 0
From the second data point x2, we have:
f(x2) == ao + O.2*al + 0.3*a2 == 0
or
0.2*al + 0.3*a2 = 0
From the third data point X3, we have:
f(x3) = ao + 1*al + 1*a2 == 1
or
al + a2 == 1
Solving the equations
0.2*al + O.3*a2 = 0
and
al+a2=1,
we have
ao = 0; al = 3; a2 = -2
Thus, the linear classifier determined using the data samples has the
following form:
f(x) = 3S - 2H
Ifwe had another sample data (S=O.6, H=O.7), then f(x) = 0.4
For (S=O.9, H=0.7), f(x) == 1.3
In practice, the actual value is "thresholded" to determine the class.

For instance, if f(x) > 0.5 then its class "Summer", and if f(x)<= 0.5, then
its "Winter". Such a function is called the "step-function" and is non-linear.
Let's visualize the linear classifier that we modelled using the sample
data points. The linear classifier in our example is a plane that slices the
three-dimensional plot of "Sunny(S)" versus "Heat(H)" versus "f(x)" on the
third axis. Figure 26 shows a graph of the "Sunny" versus "Heat", the line
shown is the intersection of the linear classifier at f(x) equal 0.5. It can be
seen that the top left corner of the line represents values where f(x) is greater
than 0.5, and the lower right part represents values of f(x) less than 0.5. This
classifier maps the top left part to the class "Summer" and the bottom right
part to "Winter". The three data points are represented as stars and belong to
the appropriately labelled classes. An important aspect to note in this
example is that even though the sample data points map to the right classes,
the actual linear classifier is not necessarily "general". It may be noted that
the classifier essentially uses the "Sunny" in a positive sense to weight
towards class "Summer", while the contribution of "Heat" is weighted
negatively. Intuitively, both should be weighted positively such that lower
values of "Summer" and "Heat" map to class "Winter" and vice-versa.
Sunny
Heat
1
Figure 26. Example of a Linear Classifier·

5. CLUSTERING
5.1 What is Clustering?
In many cases, "classified" data is not available. For instance, in some

cases, data from a group of consumers is available without any knowledge or
categorization of the user groups. In such a case, the system should be able to
automatically "group" together similar data and this model is used to make
decisions about other similar candidates.
This approach is also called "unsupervised" learning since the training

data is not "classified" or "labelled" with the right answers/classes. A
common technique used for unsupervised learning is clustering. Clustering
algorithms essentially group the sampled training data points into groups or
clusters. An example of clustering is shown in Figure 27. Each star in the
graph represents a data point. As an illustration, each circle represents a
cluster, and the points within a cluster are labelled as belonging to that
cluster. In general, any new data point (such as the "+" shown in the figure)
is classified based on which cluster it belongs to.
Figure 27. II1ustration of clustering

5.2 Techniques
The goal of clustering is to identify a number of cluster models that

represent the sample training data effectively. In many cases, each cluster is
represented by a centroid vector that is representative of that cluster. Data
samples are assigned to the cluster whose centroid is closest to that data
sample. The "distance" measure has a number of variants ranging from Ll-
norm, L2-norm, maximum distance from all dimensions and so on.
Probabilistic models use a distance function that's representative of the
probability ofthat data point belonging to that cluster.
Thus, ifX={xl, x2, x3 ... xn} is a data sample, and

Y={yl, y2, y3, ... yn } is a prototype vector for a cluster, then the
following are some examples of distance measures:
Ll-norm = L H I5n IXi - Yil

L2-norm = "" ( ~ 1=1 I~n (Xi - Yil )
Although distance measures usually satisfy certain constraints such as

syrrunetric property and triangle inequality, any distance measure may be
defined for a specific problem. In many cases, practical considerations
dictate the best distance measure to use. For instance, for very high
dimension problems, it may be computationally exhaustive to compute the
common Euclidean distance (i.e. L-2 norm).
5.3 Partition Methods

Partition methods[Han et al 2001] cluster the training data samples into a
required number of clusters (say 'k'). Each cluster is represented by a
"prototype vector". Initially, the prototype vectors may be assigned randomly
or using a subset of the data samples. The clustering procedure then assigns
the set of training data points to each of the cluster based on the closest
prototype. This is followed by a re-estimation of the prototype vectors based
on the new cluster assignments. These steps are iterated until the cluster
assignments do not change, or a minimum error measure is reached. The
variations in this approach include using different distance measures,
assignment of initial prototypes, and adjustment of the prototypes.
• K-Means Method
The K-means [MacQueen 1967] is a classic approach to clustering. The

error function minimized by the k-means algorithm is the squared error
function over all the data points. The squared error is the sum of the squared
Euclidean distance from each data sample to its assigned nearest cluster
centroid.
Steps in a k-means algorithm are as follows:

a) Arbitrarily choose k cluster centers as initial values.
b) While total error criteria not met
c) Recompute assignments of data points to the cluster centers
d) Update the cluster centers based on the new cluster assignments.
e) End of While loop
A problem with the k-means algorithm is that it often finds the local
optimum value of the cumulative error. Furthermore, the k-means is sensitive
to noise and data points that lie at the boundary of clusters (called outliers).
• k-menoids
In the k-means algorithm, each cluster is represented by the mean of the

data samples assigned to that cluster (based on the distance measure). There
are many variations possible in how the cluster is represented. In the k-
menoids approach, the most centrally located data sample which belongs to
that cluster is itself used as a representative of the cluster. This approach
alleviates the problems due to noise or outliers that is common in the k-
means approach. Variations include updating only one cluster menoid per
iteration, and subsampling the data to determine if changing the menoid
helps (See [Han et a12001] for examples).
• BFR
The BFR algorithm is based n k-means and tries to estimate the mean and
deviation along each dimension of a normally distributed cluster model. In
the BFR model, a cluster consists of a Discard Set which is the core set of
data points that belong to that cluster and are used to detennine the centroid
and standard deviation model for that cluster. The next part is the
"Compression set" which consists of the data points that are close to each
other, but far from any clusters centroid. These Compression sets are also
modelled by a centroid and a variance. Finally, data points that do not belong
to either the Discard Set or Compression set are called Retained Set and kept
in main memory.
The BFR algorithm proceeds to iteratively estimate these clusters. The

key idea is to take all points with a threshold (such as four times the standard
deviation of the cluster), and use the number of points in the different
clusters to estimate if the data point can "move" to another cluster. The
statistics of all the subclusters (such as sum of the squares of coordinates of
points in each dimension) are stored so that computation of the centroid and
variance can be done quickly and efficiently without much computation.
Another approach is the Expectation-Maximization approach to clustering

uses the EM algorithm [Dempster et al 1977] in order to estimate Gaussian
representations of each cluster. Thus, the assignment of a data to a cluster is
defined [Han et al 2001] as a conditional probability of the data point given
the cluster it belongs to.
5.4 Hierarchical Methods
Rather than start with a fixed set of cluster centroids, hierarchical

methods of clustering create a hierarchical decomposition of the data set.
There are two ways to do this: top-down or bottom-up. In the top-down
approach, all data points start off in one cluster. Each cluster is then
recursively split conditionally until a "distance function" that measures the
sparsity of the cluster reaches a required limit, or the desired number of
clusters have been extracted. In the bottom-up (agglomerative) approach,
each data point is a cluster at the beginning. The procedure then recursively
merges clusters that are "close together" into one cluster till a criteria is met.
• GRGPF
The GPRF algoritlun is based on a R-tree model to store clusters, and

assumes a non-Euclidean space. Each cluster is stored in the leaf block of a
R-tree, and contains features such as number of points in cluster, centroid of
the cluster (as determined by minimum rowsum), and sets of closest and
farthest points to centroid. The interior nodes keep samples of clusters that
represent the descendants of that node. The clustering algorithm then defines
splitting and merging criteria based on functions of the rowsums which allow
the R-tree to be updated.
• CURE: Clustering Using Representatives
CURE is an agglomerative clustering algoritlun that uses enhancements

to represent clusters better. The first enhancement is the use of multiple
centroids per cluster, and the second enhancement is the shrinking of the
cluster centers by a specified fraction. In essence, CURE is a midway
between representing the cluster by all the data points and representing the
cluster by a single centroid. Thus, CURE is more robust to outliers, and
clusters that are intrinsically non-spherical.
• CHAMELEON
CHAMELEON uses Graph methods in order to perform agglomerative

clustering. Initially, a graph is constructed with the data points, and the edge
between two data points is weighted using the "distance" between them.
Graph partitioning techniques (such as min-cut) are used recursively to
partition the graph in sub-graphs which represent clusters. CHAMELEON
improves over CURE in the metrics used to evaluate whether two clusters
should be merged together. In particular, CHAMELEON measures the inter-
connectivity and closeness of the clusters before and after a merge and uses
these criteria to determine whether to merge the two clusters into one cluster
or not.
5.5 Other Approaches
• Density Based Approaches
The clustering algorithms that we have discussed so far use some form of
distance metric in order to partition into clusters. While this is powerful,
other interesting approaches to clustering are based on "density". The set of
data points can be viewed as a space of "dense regions" and "sparse regions".
Algorithms such as DBSCAN[Ester et al 1996] counts the number of data
points that reside within a particular region and the region is classified as
"dense" or "sparse" based on this count. OPTICS [Ankerst et al 1999]
introduces the notion of cluster ordering for automatic cluster analysis. Both
methods require a spatial index structure like R*-tree[Beckmann et al 1990].
Density based approaches are more suitable for low dimensional data
clustering.
• Grid-Based Methods
In grid based approaches, the data vector space is quantized into a finite
number of cells which form a grid data structure. This reduces the
complexity of clustering over density based methods. Some examples of
Grid-based approaches are STING[Wang et al 1997], WaveCluster
[Sheikho1es1ami et al 1998], and CLIQUE[Agarwal et a1 1998]. STING is a
hierarchical grid-based approach which computes statistical information
within each grid, that is collectively used to determine good clusters.
WaveCluster uses wavlet transformation to identify dense regions in the
transformed space. CLIQUE integrates density-based and grid-based
clustering, and applies A priori property from association mining: if a region
is dense in k-dimensions, then it must be dense in (k-1) dimension
projections.
In some problems, it may be necessary to impose constraints on clusters.

The typical approach used in the constraint based clustering is to formulate
the constraints as a part of the error/distance function that is being optimised.
Enforcement of rigid conditions (such as operational constraints) lead to
computationally expensive steps in clustering.
• FastMap
Distance computation is one of the core functions performed in clustering

algorithms. If there are n points, the complexity of computing all distances is
O(n*n). The idea behind FastMap is to treat 'k' points as pseudo-axes, and
project the remaining points onto these 'k' axes in O(nk). The distance
between any two points can be computed using the projected distances onto
these axes. The FastMap algorithm proceeds by computing the 'k'
projections for each point by using 'k' pairs of points as axes. A "current
distance" is used to determine which points to project on, and this projected
current distance is adjusted. One application of FastMap to clustering is the
projection of data to 'k' dimensions and perform clustering in this reduced
space.
5.6 Example
Let's illustrate the k-means clustering procedure using a simple example.

The data set consists of 5 points as shown in Table 35. The data used in this
example are two dimensional, and we want to cluster the data into two
clusters. The centroids of the two clusters are represented by m1 and m2 and
are initialised to values shown in Table 35
Table 35. Data Samples to illustrate clustering

Data Point x-dimension y-dimension
I 0 o
2 I I
3 0 1
4 7 6
5 8 9
ml 5 2
m2 7 7
The first step is to assign the data points as belonging to cluster 1 (mean
m1) or cluster 2 (mean m2). For instance, the squared Euclidean distance of
data point 1 to centroid m1 = 5*5 + 2*2 = 29. The squared Euclidean
distance from data point 1 to centroid m2 = 7*7 + 7*7 = 98. Since 29 < 98,
data point 1 is assigned to cluster 1. After the cluster assignments are made,
the means of the data point in each cluster represent the centroids. In our
example, after the first iteration, the assignments of the data are shown in
Table 36.
Table 36. Assignment of data samples to clusters after first iteration.

Data Sample Closest Cluster
I I
2 I
3 I
4 2
5 2
Using the information in Table 36, the new centroid vales are computed
to be:
ml * = (0.3, 0.7)
m2* = (7.5, 7.5)
This is shown in Figure 28. It can be seen that the centroids have
"moved" towards the cluster of points in the data, and separated the two sets
of clusters. The "+" in the figure are data samples, and the stars are the
cluster centroids.
....
". " .
..•.
........
Cluster 2 +
.................. + *m2
Cluster 1 " .
......................
+ ....11-+
Hml .....
+
Figure 28. Clustered data samples and the two centroids.

6. OTHER DATA MINING PROBLEMS
Other interesting problems in data mining include sequence matching and

episode mining. In sequence matching, given a set of data sequences S, and a
sequence X, we need to determine the set of sequences from S that are
"close" to the sequences of X. Application of sequence matching can vary
from prediction to gene matching. One approach to the sequence matching
problem is dynamic search techniques such as Dynamic Time Warping
(DTW) algorithms. In Dynamic Time Warping, a cost function guides the
dynamic alignment search between two sequences. Another approach is to
use Fourier transform of the sequences, and match the first few co-efficients
of the Fourier expansion in order to determine "close" sequences. In the case
of sub-sequence search, a sliding window is used on the sequence data, and
the first-order Fourier coefficients stored for each sub-sequence. An
improvement over this is to store Fourier coefficients of rectangular windows
(called 'trails') over which the high-order Fourier coefficients do not vary
much.
Another problem of interest in data mining is "episode mining". In the
episode model, data samples are in the form of a temporal sequence with the
type of event specified. The goal of episode mining is to identify patterns in
sequences of events. The problem can be viewed as association mining in the
temporal domain: episodes can be defined as frequent based on a support
threshold based on the number of occurrences. Similar to A Priori, frequent
episodes (say E) implies that any episode formed by deleting that episode
(E) is also frequent. Episodes can be parallel, serial or a combination of both.
Parallel episodes indicate that the events occur in roughly the same time,
with no relevant temporal ordering. Serial episodes imply that one event
occurs after the other, and combinations can model complex event
sequences.
7. EXAMPLES OF WEB MINING
There are numerous applications of data mining techniques discussed in

the earlier sections on Web data. Web data is anything that's stored or
collected from an operation on the Web. Examples of Web data include data
entries user submit through the Web, log files containing URLs requested by
users' from a Web server or even personalized shopping history of users.
Again, it is important to note that privacy, and security must be respected,

and the mining of any personalized data collected is subject to the user's
approval (e.g. as agreed upon in the terms and conditions for accessing a
Web site).
7.1 Server Log Analysis
The basic set of Web an'alysis is exammmg server logs in order to

determine site visitation statistics. There are numerous statistics that can be
extracted from Web server logs including how many visitors visited the site,
temporal pattern of requests, and so on. This information can be further
analysed on the basis of client IP address, cookie information and field value
submissions (such as search keywords). Effectively, Web Log analysis
provides information to understand what is happening on a Web site, such as
what sites are most popular. There are three terms commonly used in Web
mining: hits, pageviews and visits. Hits refer to the actual number of file
requests to the Web server, pageviews correspond to the number of (HTML)
page requests, and visits refer to the number of unique IP address that visited
a group of Web sites within a window of time. These are related, but have
important differences. A single page request can result in multiple hits since
it could contain other files in it.
While Web server log analysis provides a number of useful information,

Web mining techniques combine Web server logs with other information to
extract other useful information. One important example is "click-through"
which tells us how often visitors have clicked on a particular online link. An
example is a image advertisement posted on the site, and when a user clicks
on that image, then the request is logged on the server, and the appropriate
page corresponding to that page is served. Such requests can be analysed to
determine the actual "click-through" that each site generates, and is very
useful to advertisers and possibly correlates with the Return on Investment
(RO!) for the Web site. Another example is geographic analysis, such as
determining what fraction of users' requests came from different
demographics (such as countries). An important issue is the generation of
statistical reports based on a collection of logs from multiple servers
supporting a site.
7.2 Click-stream Analysis
Another common use of data mining techniques is "click-stream

analysis". Click-streams refer to the sequence of web sites that the user
visited. These may be stored on the client side or a subset stored on the
server side for a set of sites. A user may be identified using the client IP
address, or a cookie generated for that session with the client. There are
many uses of click-stream data such as analysis of the sites the user has
visited to generate a "dynamic interest profile". This interest profile can be
used to recommend other sites to the user. Another use of click-stream
analysis is the study of user navigation and redesign Web sites for improving
user experience. While it is difficult to quantify "user experience", measures
such as number of clicks that the user made to reach a particular page having
high visitation can be used to redesign the Web site.
7.3 Link Prediction
Another application of click-stream analysis is "link prediction". The

click-stream logs (stored on servers serving a website) can be used
collectively to generated models or sequences of user navigation behavior.
Using these models built with the collected data, the navigation history of a
particular user can be used to "predict" which URL the user may want to
goto next. In the case of proxy servers, this can be useful in prefetching Web
documents ahead of the actual request. It can also be used to recommend
links to the user, making it easy for the user to access sites that are interesting
commonly to all users with similar link sequences. Numerous techniques can
be used for link prediction ranging from sequence analysis using prefix trees
to probabilistic predictors (such as Markov chains [Sarukkai 2000]).
Furthermore such techniques can be used for agent assisted browsing (e.g.
[Cheung et aI1997], [Shahabi et aI1997]). The system suggests links that the
user can follow during the process of browsing. The second approach is that
of tour generation(e.g. [Joachims et a11997]) wherein the system generates a
tour which takes the user from one link to another. [Wexelblat & Maes 1999]
describe the footprint system which provides a metaphor of travellers
creating footpaths which other travellers can use.
7.4 Adaptive Web Sites
Server logs and click-trails of user accesses enable us to study the

effectiveness and the User Interface (Un design of web sites. This
information can be used to improve the web site organization in a number of
ways such as reorganizing the layout of the site, add new pages dynamically,
change the formatting and presentation, and even customize the web site for
each user. Web sites [Garofalakis et al 1999] [perkowitz et 1999] can be
modified whether based on the content of the pages or based on the pattern of
users' accesses. [Srikant & Yang 2001] discuss an approach of automatically
discovering pages in a website whose location is different from users'
expectations.
7.5 User Analysis
Another important dimension in Web mining is user based mining. In

many personalized systems, we have user's information about age, gender,
demographics, interests, and other information from the user. In the analysis
of Web logs, and click-streams, it is possible to integrate such user
information to mine useful patterns from the data. For example, it may be
useful to know what is the proportion of users with the age range 20-30 years
who visit a particular site? Or what is the gender distribution of users visiting
the site? Is there an association between categories of items users purchase
and their interests stored in the user database? Mining trends by combining
user information with the access pattern is a very powerful mechanism for
improving user experience doing targeted advertising, and adding ROI to
sites.
7.6 Recommendation Systems
Recommendation systems play an important role in Web commerce,

navigation and other areas on the Web. With a vast amount of information
accessible through the Web, it is difficult for a user to spend time and find
what they exactly want. In a somewhat similar light of dynamic web page
adaptation discussion, it is important to identify potentially interesting items
or information to the user. This can be translated either to continued
stickiness (or loyalty of the user to the brand), or even direct revenue (e.g.
when a user buys more goods).
Recommendation systems serve the purpose of suggesting items of

interest to the user. The goal of such a recommendation is to increase the sale
of items in addition to the ones the user selects, much in a manner that a
salesman might sell lamps and decorative items for customers' purchasing
furniture. Such recommendation can be either personalized or product-
centric. In the case of personalized recommendation system, the user's
shopping basket or previous purchaselbrowsing history is analysed,
combined with a large amount of user data, and used to generate other items
that this specific user or group of users' may be more interested in. This
information can be presented back to the user in order to increase the revenue
due to additional purchases. In the case of product-centric approaches,
analysis is performed over products that are "associated" to each other in
some way. "Association" could represent the frequency with which
customers' purchase both items together. Such analysis can also be combined
with personalized recommendation systems in order to present items that the
user is likely to purchase. The end goal of recommendation systems is 3-fold:
conversion of browsers into buyers, increase cross-sell, and continue to build
a loyal user base with such value-added features.
Some of the key difficulties of applying collaborative filtering and

association mining related approaches to building Web recommendation
systems are:
a) Users are not willing to rate a set of items unless they are convinced of its
utility to them. Thus, it is difficult to "bootstrap" the recommendation
process automatically.
b) The quality of recommendation varies greatly depending on the area of
focus. For instance, there has been limited success with book
recommendation systems. Purchase or browsing history of books that users'
bought are used to form association rules between clusters of books. This
information is used to recommend books to the user based on the association
mining. Another example is collaborative filtering applied to a limited set of
items that significant users' have rated such as currently playing movies. In
such cases, recommendation systems can potentially add value to the user.
On the other hand, if a shopping recommendation system needs to be built,
the need for a large amount of "useful association data" or "ratings", the
large number and varieties of items, and the difficulty in ranking/presentation
of the recommended items poses challenges.
c) High dimensionality of data sets: Another important issue in application of

the techniques discussed to Web recommendation is the sheer amount of data
and the large number of dimensions. This also results in a sparsity in the
amount of training data.
The most common approach to reducing dimensionality is application of

Karhunen-Loeve (KL) transformation or singular value decomposition to
project the high dimensional vector space to a lower dimension space. This is
more suitable since the noisy features are filtered out, and the lower
dimension space is better suited for clustering procedures. [Goldberg et a1
2000] discusses Eigentaste that applies principal component analysis (PCA)
on the ratings matrix to facilitate dimensionality reduction for offline
clustering.
Recommendation systems can be built using a variety of techniques.

These techniques are not exclusive and are often combined with each other to
provide a successful solution.
• Association mining approaches to determine item relationships.
• SM [Shardanand and Maes 1995] collaborative filtering algorithm
based on linear combinations of observed ratings, putting more
weight on similar judges discussed earlier. There can be two modes
of ratings: explicit where the judge actually rates the items, and
implicit where we can assume an endorsement under certain
circumstances (e.g. click-through indicates approval rating).
• Bayesian approach [Chen & George 2001] by partitioning set of
judges into disjoint groups, and each judge within the group shares
the same probability distribution.
• Dependency networks. [Heckerman et a12000]
• Clustering approaches to determine "neighborhood formation", and
recommendation generation using the neighborhood partitions.
Proximity measures vary, but commonly include correlation or
Cosine distance.
• Combined approaches using decision trees seem to be more effective
than other methods. [Breese et a1 1998]
Another approach to improving recommendation systems is to integrate
content filtering with collaboration filtering. Content filtering refers to the
filtering of data presented to the user based on the content of the documents.
For instance, a user's interest profile can be used to identify topics of interest,
and only documents that match that topic profile are presented (or
highlighted) to the user. [Soboroff & Nicholas 1999] discuss such an
approach by combining LSI with collaborative filtering with limited success.
WebWatcher[Joachims et al 1997] uses words from the document the user is
browsing to detect the topics of interest to the user, and estimates link
probabilities using TF-IDF heuristic using the extracted keywords. A second
approach used by WebWatcher is based on reinforcement learning where
each link is represented as a state in the reinforcement learning state space,
and the rewards correspond to the TF-IDF measures. This illustrates a
combination of filtering and navigation.
7.7 Web Graph Mining
Analysis of Web sites, how important they are, how relevant they are, and
how they influence each other is an important area of study. We have
discussed methods such as PageRank TM, and the Hub-Authority Analysis
algorithms in an earlier chapter. An important notion is the Web can be
viewed as a huge interconnected graph, with various patterns that can be
extracted and mined from this graph. Hub/Authority is just one such
example.
7.8 Practical Issues
There are many practical considerations in the design and development of

Web mining. Firstly, the system must be designed to handle a very large
amount of data in hundreds of Gigabytes of data. The Web mining analysis
can require results such as real-time visitation statistics for a Web site.
Thirdly, the dimensionality of the problem domain is often very high
dimensional. Techniques such as Singular Value Decomposition, or
EigenVector projection methods are often used to project data to lower
dimensions, and mining algorithms applied in the reduced dimension space.
Feature selection is also an important aspect of Web Mining, as is proper
correlation and use of Web log information with personalized user
information to infer user trends and behavior (e.g. buying) patterns. Another
important aspect is the storage and traversal of the data. Since the amount of
data that needs to be processed is huge, it is important to develop algorithms
that do not require the entire data in main memory. Some of the algorithms
for association mining are devised with such requirements in mind. Another
aspect is the need for multiple traversal of data. Often, the data is distributed
and the algorithms applied on subset of data. However, sub-division of data
often results in loss of information, and needs to be corrected using
heuristics.
8. CONCLUSION
Analysis of Web users and usage patterns is one of the most important
advantages of the Web framework. An inherent aspect of Web technology is
the ability to capture information about user accesses, and mine this
information to retrieve a number of useful statistics, trends, and user models.
These models can be applied to various applications such as shopping
recommendation, popular sites list, click through analysis, and detailed
information on the user population (ranging from demographics, gender,
interests and preferences). While the potential benefit of Web mining is
apparent in the sales cycle, Return On Investment improvement on sites,
cross-sell in commerce, and improved navigational interfaces, privacy and
protection of any sensitive data collected is of paramount importance.
In this chapter, the various techniques of data mining were first presented.
Association mining techniques provide the ability to perform "market
basket" analysis and generate association rules with support and confidence.
Various algorithms for association mining are discussed. Classifications and
regression are also an important part of data mining and have numerous Web
mining applications. Methods of classification such as linear classifiers,
neural networks and classification trees have been discussed. Clustering is an
important area of data mining and applied on unlabelled data. Techniques for
clustering have also been presented with illustrations. Lastly sequence and
event mining is briefly discussed followed by a discussion of some Web
mining applications.
FURTHER READING
[Fayyad et al 1996] and [Bradley et al 1998] are good reviews of data

mining techniques. Many of the techniques used for classification are derived
from machine learning algorithms. [Ballard 1999] is an excellent textbook on
Ramesh R. Sarokkai 173
"natural computation". Association mmmg techniques are covered in

[Srikant & Agarwal 1995], and [Srikant et al 1997]. Other examples of work
on Web mining are [Brin 1998], and [Srikant & Yang 2001]. [WWW
Ullman] is a good online reference for data mining. Recommendation
systems and collaborative mining are extensively researched and can be
sampled in [Chen & George 2001], [Goldberg et al 2000], [Han et al 2001],
[Heckerman et aI2000], [Karypis 2000], [Sarwar et al 2000] and [Schafer et
aI2001].
EXERCISES
1) An online bookseller has recorded the following set of transactions of

book purchases. Each alphanumeric code refers to a book code. Find
the association rules using these transactions with confidences.
Table 37. List of Transactions

Transaction # Items purchased
1 AXl12, AB322, AC567
2 AB322, AC567, XX772
3 XX772,AXl12,~G754
4 AXl12, AC322, X772, ~G754
2) In problem (1), find all association rules with a support of at least

25%. lllustrate how the number of rules examined is reduced by the
application of the A Priori principle.
3) Table 6 below shows data to be used for exercises (3) and (4).
Table 38. Classification Training data.

Data Vector Class
[Age, Gender] [~ovie Chosen]
[I 8, ~] ~I
[20, F] ~2
[35, ~] ~2
[45, F] ~2
[60, F] ~2
[24, ~] ~I
Data Vector Class

[Age, Gender] [Movie Chosen]
[55, F] M2
Detennine a linear classifier that fits the data shown as far as possible.
4) Generate a decision tree classifier that fits the data shown in table 6.
5) A sample visitation data for a site is shown in table 7 below.
Table 39. Sample clustering data for exercise 6.

[Age, GeoCode,Visited in last 3
months?]
[22, A, Y]
[18, A, Y]
[25, C, N]
[32, D, N]
[55, A, Y]
[45, A, N]
[33, D, N]
[22, D, N]
[46, D, N]
Cluster the data using k-means into two clusters. Map non-numeric values
to numeric values for the purpose of clustering (e.g. A=I, B=5, C=10
etc.). Using the cluster centroids, describe the trends in type and location
of users who visit the site, and whether they visited in the last three
months or not.
6) Mine a Web server log from a day to extract the following:

• Number of hits and Pageviews
• Number of unique client IP's
• Distribution of requests from the different client IP's
7) Using the Web server log, sort the sequence of URLs requested by
client IP address and time. Extract three sequence URLs and
threshold with a certain minimum count. See if these sequential
pattern ofURL traversal holds for new requests.
8) The matrix below shows a set of items that are rated by a set of
experts.
Table 40. Collaborative Filtering data for exercise 8.

Web Page I Web Page 2 Web Page 3 Web Page 4
User I Good Bad Not ranked Good
User 2 Bad Good Good Good
User 3 Good Not Ranked Bad Bad
User 4 Good Good Good Good
Use the 8M algorithm to estimate the missing ratings for the user data
shown below:
User 5 likes Web pages 1,2, and 4. Missing ranking for page 3.
User 6 likes pages 3, and 4, but dislikes 1. Missing ranking for page 2.
Chapter 7
Messaging and Commerce
The best does not come alone. It comes with the company ofthe all
Abstract: Communication and commerce applications are important part of Web

applications. Mail has long been supported on the Internet, and standards such
as Simple Mail Transfer Protocol (SMTP) have been defined almost two
decades ago. However, with the advent of the Web, the ability to register for
and use these services has generated explosive adoption of such messaging Web
applications. Instant Messaging (1M) is another important application that is
discussed in this chapter. The second part of this chapter summarizes some of
the e-commerce issues and technologies available. A prototype e-commerce
architecture is presented and issues such as billing, payment technologies, and
standardization efforts explained.
Keywords: Electronic Mail, e-mail.instantmessaging.IM. Simple Mail Transfer Protocol

(SMTP), Post Office Protocol (POP), Internet Message Access Protocol
(IMAP), E-commerce systems, billing, payment technologies.

1. INTRODUCTION
In the previous chapters, we discussed applications that portray the Web

as a medium for information access. The Internet and the Web play an
important role as a medium of communication and commerce that brings
people together, and enables them to buy/sell using the Web.
E-mail and instant messaging (1M) are two compelling communication

applications used by millions around the world to communicate with each
other. Electronic mail protocols SMTP, POP, and IMAP are defined and
discussed. Next, instant messaging is chosen as an illustrative application to
discuss in this chapter primarily because it requires a scalable, distributed
architecture for real-time communication using the Internet. Next, an
overview of the components of an electronic commerce platform
architecture is presented and standard e-commerce frameworks summarized.
Issues in designing and development of commerce applications are covered
towards the end of this chapter.
2. MESSAGING APPLICATIONS
Communication is an integral aspect of the Internet since its inception.

Even in the early days of ARPAnet, email existed. The rudimentary "talk"
program on UNIX allowed users to talk to each other in real-time. With the
popularity of the Web grew the widespread adoption of communication
applications such as e-mail, instant messaging, chat, message boards, and so
on. While the applications themselves can exist without the Web, the Web
provides an enhancement and better accessibility to these services in a
unified manner.
Standardization efforts have come a long way in some of these

applications. Mail access protocols such as SMTP, IMAP4, and POP3 are
well-supported and are covered in the upcoming sections. Communication is
not limited to text communication, and other forms of communication such
as voice chat can be achieved using voice over IP technology. Emergent
applications such as video conferencing over the Web are compelling,
although bandwidth limitations hinder the ability to provide a scalable, high
quality, high resolution real-time solution.
3. ELECTRONIC MAIL PROTOCOLS
3.1 E-mail overview
Electronic mail existed before the Web. The ability to transfer electronic
messages from one computer in a network to another has existed for a few
decades now. What the Web did was make applications so that electronic
mail (or e-mail) is accessible for a large number of users. Email web
applications allow users' to goto a Website and register for their
personalized e-mail address. These sites also provide management tools to
read, delete and manage the list of messages. With such a simple and
flexible access to registering and using e-mail, the Web fuelled the large
scale adoption of e-mail through the Web.
Figure 29 shows an overview of an e-mail system. The user's computer

(client) connects to the Email Web server. A new user can register into the e-
mail system through the web server, and this information is stored in the
User information database. The user can then "login" to their email account
after being verified by the Web server. The Web server then accesses the
email data from the Message server(s) that have access to a distributed mail
storage database. The message server can forward messages to another email
server (for outgoing e-mails) using protocols such as Simple Mail Transfer
Protocol (SMTP). Other mail servers or clients that want to directly access
the mail for a particular account can do so directly using protocols such as
Post Office Protocol (POP) or Internet Message Access Protocol (IMAP).
Other components of a Web based email system include anti-spam module
(which identifies and filters unwanted email from your incoming mail
folder), and address blocking list.
Email
Web
Server
Message
Server Message
Store
Figure 29. Overview of an e-mail system.
3.2 SMTP
Simple Mail Transfer Protocol (SMTP) is a mail transport and delivery

protocol. It is sometimes used as a "mail submission" protocol and serves
that purpose also frequently. The objective of SMTP is to transfer mail
reliably and efficiently. SMTP can work on any transmission subsystem that
provides a reliable ordered data stream channel, although TCP is the
commonly used transport layer. SMTP can also perfonn mail relaying where
mail is transported across networks.
Whenever a client wants to send mail, it establishes a two-way

transmission channel to an SMTP server. The destination email address is
used to detennine which SMTP server to connect to: final target host or an
intennediary Mail exchanger host. The basic block of an SMTP is a mail
object that consists of an envelope and content. The SMTP envelope consists
of an originator address, one or more recipient addresses, and other
extension infonnation. The SMTP content is sent in two parts: header and
body.
An SMTP session is initiated when the client opens a connection to the

SMTP server, and the SMTP server responds with an opening message.
Upon receiving the welcoming message from the server, the client sends the
extended Hello (EHLO) command. The client's identity is sent in the EHLO
conunand. The next steps are the actual mail transactions. Mail transactions
starts with the MAIL conunand that identifies the sender. The RCPT
conunand identifies the recipients, and multiple RCPT commands can be
specified in case of multiple recipients. Lastly, the DATA command initiates
transfer of mail data. The data is usual1y tenninated by a "end of mail"
indicator which also confinns the transaction. The QUIT command may be
sent by the client to complete the session and close connection. An example
ofa sequence ofSMTP commands is shown in Table 41.
Table 41. Example of an SMTP session.

Client=C SMTP command
Server=S
S 220 example.com Simple Mail Transfer Service
Ready
C EHLO test.com
S 250-example.com greets test.com
S 250-8BITTIME
S 250-SIZE
S 250-DSN
S 250 HELP
C MAIL FROM:<author@test.com>
S 250 OK
C RCPT TO:<user@example.com>
S 250 OK
C DATA
S 354 Start mail input; end with <CRLF>.<CRLF>
C {actual message ... }
C
S 250 OK
C QUIT
S 221 example.com Service closing
SMTP servers can also operate in a gateway or forwarding mode. In

order to reliably deliver the e-mail message, SMTP clients must provide
timeout mechanisms, and SMTP server should have retry strategies. As
such, security is weak in SMTP. Real security can be achieved using only
end-to-end methods such as using digital signatures.
3.3 POP3
The Post Office Protocol (POP) is a simple protocol that al10ws a system
to retrieve mail stored on mail servers (such as SMTP servers that support
POP). POP is a simple protocol that limits ability to retrieval and deletion of
messages and not extensive manipulation operations. We will discuss POP

version 3 (POP3) in the following paragraphs.
The POP server listens for requests on port 110. A client that wishes to
make use of the service makes a TCP connection to the server. A greeting
message is sent by the POP server to the client to start the session. The next
step in the session is authorization, wherein the client sends identification
information to the server. Once successfully identified, the server acquires
resources associated with the clients mailbox. The client then proceeds by
requesting actions and the server performs the actions and returns responses.
Once the required series of actions is completed, the client can send a "quit"
command and the connection is closed. In some cases such as auto logout
timeout, the connection can be aborted.
• Authorization
Different mechanisms of authorization are allowed in POP3, including
the USERIPASS commands and the APOP command. In the USER/PASS
scenario, the client must first issue the USER command. If the mailbox
exists and allows plain text password authentication, then the server returns a
positive response. The PASS command is next issued by the client to
complete the authorization process. In the APOP command, the argument
includes a string identifying the mailbox, and a encrypted (MD5) digest
string. The digest string is computed using a timestamp issued to the client
by the server and a secret string shared by the client and the server. The
reason for the APOP method is to avoid sending the password in the clear on
the network.
• Transactions
Let us now discuss the transactions allowed in POP3. The basic POP3
response is either "+OK" or "-ERR". The commands that the client can send
includes QUIT, STAT, LIST, RETR, NOOP, and RSET.
The STAT command is used to retrieve information about the mail drop
including the number of messages and the size of the mail drop. The LIST
command takes the message number as an option, and the server retrieves
the information for that message. If no message number is specified,
information about all the messages for that mail drop is returned. The RETR
command takes a message number as an argument, and the server retrieves
the message corresponding to that message number. The NOOP is a no
operation command. The RSET command resets and marks all messages
marked as deleted as unmarked. The QUIT command is used by the client to
indicate that the client has completed the session. The server removes all
messages marked as deleted from that mail drop, unlocks the resources
associated with the mail drop, and closes the TCP connection.
3.4 IMAP4
The Internet Message Access Protocol (IMAP) allows the client to access
and manipulate electronic mail messages on a server. In this section, we
discuss IMAP version 4, rev. 1 [IMAP4vl]. IMAN allows clients to access
remote mail folders (called mailboxes), and make them seem functionally
equivalent to a local mailbox. IMAP allows creation, deletion and renaming
of mailboxes. IMAP allows MIME-parsing, searching, selective fetching of
message attributes and so on. Messages are accessed in IMAP using unique
identifiers or message sequence numbers. Unique message identifiers persist
across sessions, pennitting the ability for a client to resynchronize its state
from a previous connection with the server.
IMAP server listens to port 143 when TCP is used for transport. The first
step is the establishment of connection between client and server. Next the
server greeting is sent, followed by client server interactions consisting of
client command, server response and a server completion result response.
The client command that starts an operation is prefixed with a short
alphanumeric string, called an identifier tag. This identifier is unique for
each client command. Data transmitted by the server to the client is prefixed
with "*,, or "+". "*,, indicates that the response does not indicate command
completion and are called untagged responses. Three possible completion
commands include OK (success), NO (failure) or BAD (error). Server
responses are generally of three types: status response server data, and
command continuation request.
Once a connection is established, the client is in the non-authenticated

state. In this state, the client can only supply authentication commands (e.g.
LOGIN or AUTHENTICATE). Once authenticated, the client next selects
the mailbox to access before the commands that affect that mailbox are
allowed. This is achieved using the SELECT or EXAMINE command. At
this point, the interaction is in the selected state, wherein the client can issue
various commands. Finally, the client can initiate a logout using the
LOGOUT command and the connection is closed, or the server can
unilaterally tenninate the session.
IMAP4 supports a number of commands, and we highlight some of the

commonly used commands:
SELECT: selects a particular mailbox
EXAMINE: similar to SELECT, but read-only mode

CREATE: creates mailbox
DELETE: deletes mailbox
RENAME: renames mailbox
LIST: retrieves subset of names from the complete set of names available
to client.
STATUS: gives status of mailbox
EXPUNGE: deletes all messages marked delete in selected mailbox
SEARCH: search messages that match given search criteria.
FETCH: retrieves message data
Unlike POP, IMAP allows the simultaneous multiple access to a single

mailbox. Note that IMAP does not specify any implementation guarantee,
thus clients must be able to handle failure of multiple SELECTs to the same
mailbox gracefully. IMAP search functions can also be slow on mailboxes
with a large number of messages, and must be used with care.
4. 1M ARCHITECTURE
In this section, we discuss the architecture of a prototype instant

messaging system. An instant messaging (1M) system enables point to point
delivery of messages within a short period of time (hence instant). The
messaging is generally initiated using client applications such as Yahoo!
messenger. 1M systems typically provide enhanced services such as
maintenance of a list of friends or buddy lists, and online status of friends. If
the party is online (i.e. currently connected to the Internet and accessible via
the 1M client application), then any messages directed to that party are
delivered within a short period of time. If the party is not online, then the
messages may be stored and delivered when they are online. Variations to
such 1M features are being added constantly, and new functionality added.
At the lowest level, an 1M system is an enhanced "talk" program. In early

UNIX days, "talk" allowed communication between two parties connected
via a network. 1M technology has come a long way since then with the
ability to scale and handle millions of users and billions of messages being
sent on a daily basis across the Internet. Furthermore 1M systems have
enhanced communication capabilities such as a list of buddies or friends,
their online/offline status, audio features, and conferencing features.
Friends
Storage Socket Desc.
Server Info. Server
Client
A User
Authentication
Server
Client
B Online/Offline
User status Server
Offline
Messages
Figure 30. Overview of prototype 1M system.
Before we go into a general 1M architecture, let us illustrate the sequence

of steps involved with an illustration:
- Client initiates connection to server.
- Client is authenticated by server and logged on as user "A".
- Status of "A" on other clients is updated.
- Online status of friends of "A" are retrieved and displayed.
- "A" selects another user "B" to 1M with.
- "A" sends message to "B"
- Server receives message, checks status of "B"
- If"B" is online, then sends message to "B", or else stores message.
On the flip side, the sequence of steps that are needed when a 1M client
logs out or disconnects are:
- Client "A" disconnects from server.
- Server detects this, and releases appropriate connection after informing
connection manager
- Status of user "A" set to offline by sending message to status server.
- All online friends of "A" are notified of new status.
- Future messages to "A" are stored in offline message store.
How do we build a system to achieve this form of messaging? In the

following section, we present a prototype architecture of an 1M system. The
different components of an 1M system are illustrated in Figure 30. The
"User Authentication" module validates whether the user is a real user or
not. The status of a user, i.e. whether the user is "online" or "offline", is
managed by the "Status" module. The "1M connection" server is used as the
primary point of connection with the clients. Clients and 1M servers
generally communicate with each other using a protocol, generally called the
1M protocol. There are efforts underway to have a common universal instant
messaging protocol format that allows inter-operability of different 1M
clients.
What happens during an 1M session? The first step is to establish a

connection between the 1M client and 1M server. There are many
possibilities- UDP, TCP or a combination. UDP can be used for data
transmission, but the delivery is not guaranteed. TCP involves maintaining
the connection descriptors for each open connection. The next issue is
whether the two parties communicate directly with each other or whether
they communicate only through the 1M server. The simplest approach is to
have the party send the 1M messages to the server, the 1M server looks up
the socket descriptor of the receiving party, and writes (forwards) the
message to that socket. If the 1M packet is a command to the server, then its
not forwarded to the other party, rather the server performs the appropriate
action based on the 1M command and sends an acknowledgement back to the
1M client. An alternative is for the two 1M parties to connect directly with
each other after the 1M server authenticates both parties. That way the
messages can go directly between the two parties rather than incurring the
overhead of being forwarded by the 1M server.
Now that we have a general picture, let us reiterate the design choices in
an 1M system:
- Connection type: UDP, TCP or a combination?
- Clients connect directly?
- How to ensure privacy and security?
Security and privacy are achieved by encryption of the transmitted

messages. In order to handle many clients, the 1M architecture is scaled by
replicating the 1M connection server shown in Figure 30. The 1M connection
server is now replaced by a pool of servers, and a central "connection
manager" holds information on which clientllP address is connected to
which specific 1M connection server. When a client sends a message to
another party, the 1M connection server of that client looks up the 1M
connection server of the destination party by querying the "connection

manager". The "connection manager" returns the IP address of the "1M
connection server" that handles the destination party, and the 1M message is
forwarded to that 1M connection server. The 1M connection server then
forwards the message to the destination party.
Two important aspects dictate the performance and scalability

characteristics of 1M systems.
(i) The management (open, close, read, write) ofa large volume of socket
connections.
(ii) Ability of various servers to communicate with each other (e.g. one
"1M connection server" with another or with the "connection manager",
"Friends database server", "Status Server" etc.)
How does the architecture shown scale to high volumes? Firstly, a

divide-and-conquer approach is used to achieve a distributed and scalable
design. Next, specialized servers perform specialized tasks. For instance, the
connection manager servers optimize the maintenance and reading/writing
into socket connections. As far as possible, state information is not
maintained in the servers. Carrying state information can increase the
complexity of the system, and result in centralized or shared databases that
can be serious bottlenecks.
5. COMMERCE APPLICATIONS
Commerce on the Web provides a large opportunity for businesses to

enable sales using the Internet. The last few years have seen an enormous
interest in the area of Web commerce, but the high short term expectations
did not become a reality, resulting in the collapse of many dot com
companies. Nevertheless, with the right business model, e-commerce is re-
emerging as a realistic monetary opportunity for companies, and a
compelling mechanism of purchase for consumers.
Web commerce includes a variety of applications from subscription

services, shopping, auctions, to banking and other transactional services. In
other words, any commerce activity that can leverage off the Web can be
broadly categorized as Web Commerce. The main difficulties towards the
large-scale adoption of Web Commerce are:
a) Consumer confidence and trust
b) Security concerns
c) User privacy
d) Brick-and-Mortar store mentality
Before a consumer purchases on the Web, the consumer must trust both
the medium and the Web site in order to disclose private information such as
a credit card information. The consumer must be convinced that this
information will not be misused by the company or any other entity will not
get unauthorized access to this information. Furthermore, the commerce
activity of the user must be strictly confidential, and the privacy terms
agreed to by the user must not be violated in any manner. Consumers need a
trustworthy brand that provides goods of value and quality. In certain
applications such as airline and hotel reservations, the Web provides an easy
and convenient mechanism of shopping. For instance, the consumer can
search through different itineraries, and prices, before choosing a ticket
purchase.
Security is a very important concern of many users. When users' give

personal information such as credit card, checking account number or
address information, it is imperative that this information not fall into the
wrong hands. Privacy is another important area of concern to the end user.
With current technology, it is possible to keep a history of information on
what purchases an user has made, and what Web sites he has visited. On the
one hand, such information can be used to make relevant offers to the user,
or advertise items pertinent to the user's interest. On the other hand, the user
is entitled to his privacy and this user information should be used in a
manner consistent with the user's expectations. The final impediment
towards the adoption of e-commerce is the brick-and-mortar store mentality.
Traditionally, purchase of goods has involved a physical component,
whereby the consumer examines the object that he wants to buy. However,
over the last few years, this mentality has shifted enormously, and online
shopping is getting more popular, at least for certain types of goods.
Furthermore, enhanced web interfaces enable the shopper to examine audio,
images or video of items of interest.
6. OVERVIEW OF E-COMMERCE FRAMEWORKS
In this section, we present a summary of e-commerce

frameworks[CEN/ISSS 2001]. Next, we discuss how trading can be
achieved in the context of trading models such as RossettaNet and IOTP.
Payment models also play an important role in the development of e-
commerce. Electronic payment technologies are first presented, payment
models such as SET and Interactive Financial Exchange (IFX) standard is

covered.
Many of the technologies are built on top of XML technology. The

primary reason for the widespread adoption of XML is the ability to describe
and exchange information between collaborating applications or business
partners in a platform-independent manner.
6.1 E-commerce Frameworks
• Biztalk
BizTalk framework enables communication between business and

organizations and is built off standards like XML, HTTP, MIME-Extension,
and SOAp 7 • Microsoft introduced BizTalk as a technology that addresses
interoperability issues, and enables the development of XML based
messaging solutions between applications or businesses. The BizTalk layer
allows different transport protocols such as HTTP, SMTP and message-
oriented middleware.
There are three concepts used in the specification of BizTalk:

a) BizTalk Framework Compliant (BFC) server
b) Application: is the actual business logic executed and business data
stored, in addition to consuming business documents and communicating
with a BFC.
c) Business document: is essentially a business document in XML.
The BizTalk document is a SOAP message that contains the business

document in its body section. The header includes BizTalk specific entries
including lifetime (duration of validity of document), identity to identify
Document, acceptance by receiver, idempotence (ability to transmit and
accept more than once without any difference), and delivery or commitment
receipts. Document handling and interchange is controlled by the BizTags
that are a set of XML tags. A BizTalk message is used to send BizTalk
documents between BFC servers. Single hop transport can be secured using
SSL, but multi-hop transport requires mechanisms such as S/MIME.
7 SOAP is discussed in the chapter on "Web Services".

• EbXML
The electronic business XML framework uses Unified Modelling

Language (UML) for modelling aspects and XML for syntactic aspects. The
various modules supported by ebXML systems include business process,
core components, registry and repository, trading partner information, and
transport routing and packaging. The ebXML architecture is defined by two
views: business operational view and functional service view. The business
operational view defines the business processes using UML, and consists of
a Lexicon, design use case diagrams and descriptions, business processes
creation phase in the form of class diagrams, and a design phase which
applies standardization. Interoperability in ebXML is achieved using the
Business Objects and the Business Object Library created by analysis of the
Business Objects.
The second view is the Functional Service View (FSV) that is the
framework to discover and convey the Business Object information. At the
core of the FSV are "distributed repositories" which hold the Business
Process and Information Models, the ebXML MetaModel, Trading Partner
Profiles, and ebXML specification. EbXML Registries serve the purpose of
retrieval of the repository. The Trading Partner Profiles (TPP) defines the
partner's capability and security mechanisms supported, while the Trading
Partner Agreements (TPA) define the agreements between the trading
partners.
• ECo Framework (ComerceNet)
ECo framework was introduced by CommerceNet and focusses around

the notion of service discovery. Eco works off the premise that it is not
possible to build an e-commerce framework where there is one set of
documents, one interface model and one transaction model. Rather,
depending on the need of each business, there will be many specifications.
Eco provides an ability to discover these different protocols, enabling
negotiation on which protocols to use, and process transactions. All
transactions are in the form of exchanged messages or documents. The eCo
framework is divided into 7-layers, each of which has an associated registry:
Networks, Markets, Businesses, Services, Interactions, Documents,
Information Items. The first three layers (Networks, Markets and Business)
are focussed on discovering places providing a particular service. The
remaining layers specify how a transaction can be achieved on the products
or services discovered. ECo framework also provides a mechanism of
querying these layers of information. This is achieved by the presence of a
registry in each of the layers. Furthermore, the definition of queries and the
response documents are defined in the registry's public interface.
- Java Ecommerce Framework
Sun Microsystems developed a Java™ based infrastructure for e-

commerce. This suite includes support for a client-side wallet and a set of
commerce JavaBeans. The goal of the Java E-Commerce framework is to
support all standards and payment protocols. The client-side wallet is the
Java Wallet™ that contains "cassettes"- a cassette is a digitally signed (Java
archive file) containers for Commerce JavaBean components. The Java
Wallet consists of the Java Commerce Client (lCC) containing interfaces to
the rest of the Java commerce platform, database infrastructure for storing
user information and transaction logging, operations on financial aspects
stored in the Wallet, user interface, security models, and Java Commerce
Messages to interact with commerce servers.
-.NET
Microsoft has developed the .NET platform for rapid adoption and
development of web services8 • Web services enable service registration,
lookup and easy integration. An important aspect is that the .NET platform is
bundled with a base set of services including personalized user database
management (called passport) which include e-commerce related
information like the user's electronic wallet, and can be integrated by
vendors to do transactions.
-OMG Commerce
OMG architectures use objects to enable electronic commerce functions.

An example of a OMG compliant interfaces is the CORBA Electronic
Domain Architecture which is comprised of document, community,
collaboration, and DOM mapping modules. Each module is detailed with the
interfaces, and each interface is described with the relevant attributes,
events, and operations.
8 Web services are covered separately in a later chapter.

6.2 Trading Models
- Ad Hoc Trading Models
One approach is to have ad hoc trading models where different function

blocks are defined and implemented. For instance, a four-block mode with
Sourcing, order processing, Supply chain management and Settlement is a
general model for business to consumer or business to business applications.
-IOTP
The Internet Open Trading Protocol (IOTP) is maintained by the IETF

and aims to provide a unified framework for trading on the Internet. The
10TP defines the content, format and sequence of messages that flow
between the trading parties, and is also based on XML. The goal of 10TP is
to provide a trading environment similar to the brick-and-mortar experience
including generation of invoices and receipts, consistent interfaces for all
trading steps independent of the parties, and encapsulate any payment
method. 10TP defines four types of trading exchange: offer, payment,
delivery and physical goods. The 10TP specification defines the roles of the
consumer, and merchant in XML, including the content, format and
sequence of messages that are exchanged by such parties.
- OAG Framework
The ability for heterogeneous business applications to communicate with

each other is provided by the Open Applications Group XML framework,
ranging from traditional ERP integration, supply chain management to
electronic commerce. Messaging again is XML based, and base processes
such as purchases and invoice are defined. The different components of an
OAG framework include Inventory, Order Management, Receivables, Cost,
Pricing and Ledger. OAG includes XML schemas for businesses to share
information. In particular, OAG caters to the traditional supply chain
management, and ERP integration needs. The Business Object and its
structure are defined in the OAG Integration Specification (OAGIS).
-OBI
The Open Buying on the Internet (OBI) protocol provides a secure and
interoperable framework for business-to-business commerce development.
HTTP is used at the transport layer, and the security is achieved by using
SSL. The OBI architecture consists of requisitioner, buying party, selling
party, and payment authority. The sequence of interactions is product

selection, order placement, approval, order fulfilment and payment.
• RossettaNet
A non-profit organization to support e-conunerce standards, RosettaNet

provides three elements to enable Business-to-business interoperation:
Partner Interface Process (PIpTM), RosettaNet Implementation Framework
(RNIfTM), and business and technical dictionaries. In order to trade with
each other, partners agree on the PIPs to use, define a trading partner
agreement, and enable the e-business. At the lowest level, PIPs specify
actions, and are described using UML specifications. The actual content of
the business documents exchanged is in XML, along with message
guidelines. RosettaNet also specifies a message envelope to package the
messages. This envelope includes a MIME encapsulated preamble, delivery,
service headers, and the actual payload data.
6.3 Payment Models
Electronic payment technologies
There are several technologies for electronic payment. Some of the early
technology embedded microchips in credit-card like plastic pieces. This
embedded microchip permits flexible options such as choice of payment
mechanisms (such as Visa, or MasterCard), and the ability to store a large
amount of information. Various flavours of such "smart cards" include
ability to recharge, transfer payments, and use in conjunction with credit
cards. Some examples of efforts on development of Smart Cards include
Integrated Circuit Card (ICC) Specification for payments, and VisaPay.
Another form of electronic payments is the "token". Tokens may not
necessarily be associated with real money although they frequently are.
Tokens can be associated with a certain value and be used for electronic
transaction between parties. While electronic token systems are very
promising, there are many difficult issues such as ensuring atomicity of
transactions involving tokens, and network issues when transporting tokens.
Micropayment technologies allow 'low-value" transactions. The idea is

to allow cumulative settlement of very small charges. In some cases, the
payment is too small for electronic payment methods such as credit cards.
Micropayment technologies are also useful in pay-per-view type services
where the cost of an individual transaction could be very low, but the
cumulative transaction could be add up to a significant amount. Some of the
efforts in the arena of micropayments include Common Markup for

Micropayment Per-Fee-Links (W3C specification), IETF's Micropayment
Transfer Protocol (MPTP), MilliCent (Compaq), and IBM's Mpay
specification. Another type of payment is the "home banking" standards that
allow integration with banks, credit card companies, (recurring) business
payment, and consumer payments. Examples of banking standards for
financial information exchange are Homebanking Computer Interface
(HBCI), Open Financial Exchange (OFX) and Interactive Financial
Exchange (IFX) discussed later.
• SET
MasterCard and Visa jointly developed the Secure Electronic Transaction

(SET) specification in order to enable secure transfer of credit card
transactions over the Internet. The SET protocol provides the cardholder to
securely transmit payment instructions. To the merchant, the protocol
enables the ability to obtain authorization and receive payment for an order.
Public Key Infrastructure (PKI) and digital certificates provide the ability to
make secure transactions.
6.4 Example of a Financial Exchange standard: IFX
As an illustration ofaXML payment specification, we will discuss the

Interactive Financial Exchange (IFX). The IFX combines features from the
Open Financial Exchange (OFX) standard and Gold versions. IFX
eliminated in-band email present in IFX, enhanced bill payment and
presentment, added support for B2B transactions, and designed for greater
scalability. The IFX framework aims at consistent naming standards,
consistent structures, and easy extensibility. At the core of IFX is a business
message specification that is described in XML. The lower level transport
used is generally HTTPS.
IFX systems can be a IFX service provider or an IFX gateway or both.

When a client wants to initiate a transaction, it connects with an IFX
gateway server or an IFX server, and issues commands to perform services
or operations. These operations can then be transmitted to another IFX
gateway server, IFX service provider or integrated into a legacy system. The
IFX server will have a functional component stack as shown in Table 42.
Since the IFX server needs to communicate using HTTP(S) with the IFX
client, there is a HTTP server component. Since the document format is
specified by a DTD or XML schema, IFX systems will need a XML parser
to parse the XML documents, and a XML generator to generate XML

messages. Similarly on the other side of the stack is a XML parser and a
HTTP client to communicate with other service providers. IFX Message
handler interprets and processes the IFX messages.
Table 42. IFX gateway/service provider functional component stack.

HTTPClient
XML Parser/Builder
IFX Message Handler
XML Parser Builder
HTTP Server
Every IFX request document must contain' a SignonRq called the signon
message. The services are then listed and the actual message/command
embedded with each service. Some of the services that are supported by the
IFX standard include base services such as enrolment and maintenance,
banking services, payment services, billing and notification services.
Messages can be in the form of a request message or a response message.
Several types of common messages used in IFX include:
• Add request <xxxAddRq> and response <xxxAddRs>

• Modify request <xxxModRq> and response <xxxModRs>
• Delete request <xxxDeIRq> and response <xxxDeIRs>
• Cancel request <xxxCanRq> and response <xxxCanRs>
• Inquiry request <xxxlnqRq> and response <xxxlnqRs>
• Audit request <xxxAudRq> and response <xxxAudRs>
"xxx" can be replaced by the various services supported by IFX. For

instance, with the Payment service, the <PmtAddRq> creates a new
payment. Except for SignOn and SignOff, all IFX messages have service
wrappers associated with them. During the SignOnRq process, the client
may request a session ID with a specific expiration time from the server.
This session ID may be used by the client on subsequent SignOns. The use
of a session ID also allows better session management.
Table 43. Example of an IFX document.

<?xml version="l.O" encoding="UTF-8"?>
<?ifx version="1.0.1 "?>
<!DOCTYPE IFX PUBLIC "-llifx forum/lifx 1.0.2. I dtd//en"
''http://www .ifxforum.org/dtd/ifx1.0.2. I.dtd">
<IFX>
<SignonRq>
</SignonRq>
<PaySvcRq>
<RqUID>some UUID</RqUID>
<SPName> example.com </SPName>
<PmtAddRq>
<RqUID>some UUID</RqUID>
</PmtAddRq>
</PaySvcRq>
</IFX>
An example of an IFX document is shown in Table 43. The first step is

the Signon request. Next the Pay service block is initiated. The Request
Unique Universal Identifier (UUID) sent by the client is passed. The
Payment Add request is specified with the Request UUID Sent by the client.
The SPName specifies the service provider name.
7. EXAMPLE ARCHITECTURE
The major components of an example e-commerce platform are shown in

Figure 31. The first component is the monetary electronic payment
information, also called as the electronic wallet module. The goal of any e-
commerce system is to enable a commerce activity on the Web. For any
commerce activity, a method of payment is needed from the consumer. On
the Web, electronic payments are only possible. Thus, the "electronic wallet
or electronic payment information" module is a key component of any e-
commerce infrastructure. The second component is the "billing system"
whose functions range from periodic billing, verification and blocking of
funds, generation of billing reports to the consumer, and so on. The billing
system typically interacts with a third party authority that performs the
actual verification and charging of funds from specified electronic payment
accounts. Another important component of an e-commerce infrastructure is
the "Fraud Controller". The purpose of the "Fraud Controller" is to minimize
the effect of fraud on the e-commerce system by blocking accounts of
malicious users' or hackers. The last component is the actual commerce
module that determines what are the goods that is being sold to the
consumer, and often comprises of its own set of components (e.g. servers
and databases). These application servers may also generate front-end Web
pages for the services or goods being sold, and may interact with a wide
variety of other external modules in order to accomplish the end goal: enable
the consumer to purchase a service or goods through the system.
Inventory Electronic
Manager Wallet
I
Commerce
Application Billing
Server Shipping
Module
I
Fraud Transaction
Controller Manager
Figure 31. Prototype E-commerce architecture.
7.1 Electronic Wallet

Every e-commerce system needs an electronic method of payment from
the user in order to complete a transaction. This electronic payment
information is in the form of credit card numbers, or checking bank
accounts. Additional information such as the address of the consumer may
be needed for verification purposes. Such information can be provided by
the user for every purchase, but tends to be tedious. An alternative is to
allow the creation and storage of an "electronic wallet" that is associated
with a specific user, much the way in which a user carries his own wallet.
This wallet storage system should be scalable, secure and fast.
7.2 Billing System
The heart of an e-commerce system lies in the billing components. The

functionality of the billing module includes the following:
a) Management of subscriptions, sales, promotions.
b) Charging of billed amounts, including account verification, charging
and other integration with third party finance billing vendors.
c) Ability to manage, backup, and rollback transactions.
d) Generation of accounting reports.
e) Mechanism to enable fraud protection in conjunction with the "Fraud
Control" module.
The primary objective of the billing system is the support of a variety of

subscription and sales services. The design of the billing system should be
flexible enough to cater to a wide variety of product or service plans. For
some services, a monthly charge may be the appropriate form of charging
users. For other services, a pay-per-use model may be more meaningful. The
complexity of the product billing model increases significantly with special
offers, limited time sales, promotions such as discounted first month,
discounted pricing for certain set of users' and so on. Furthermore, the
pricing model may have to change when the user has subscribed to multiple
services or products. The billing system must be flexible enough to cater to
such a vast variety of pricing models.
Upgrades
Downgrades
Cancellations
Figure 32. Pricing and Packaging.

Scenario 1: Pay-per-Use
In this fIrst scenario, the consumer selects an item or a service to
purchase. For example, the service may be access to a live broadcast event
between certain times. The user must be charged after they are successfully
able to watch the broadcast program. In such a scenario, it is meaningful to
charge the user for this single event, and deduct that amount after the user
has access to that event. Generally, pay-per-use is a simple form of purchase,
since the user can be charged immediately after approving the purchase. In
some cases, complications may arise. For instance, what happens if the user
is charged for the broadcast service, and starts watching the event, but is
disconnected by the server during the event. In such cases, the payment
account information of the user is verifIed at the time of purchase, and the
actual charge to the account done after the user has successfully watched the
event.
Scenario 2: Subscription Model

Subscriptions generally have a periodic billing nature associated with
them. The user agrees to have a charge periodically deducted from his
payment account in order to get a continued service. There are many
variables in a subscription billing including the cycle of billing, initial
subscription cost, and support of multiple pricing offers for the same
subscription service.
Subscriptions also require a module (see figure) that monitors the billing
cycles of all the subscribers, and charges the appropriate users at the
appropriate dates for the respective amounts. Thus, unlike pay-per-use
model, subscriptions require recurring charges to the user on a periodic
basis.
Service
/Products Pricing
Manager Models
Subscription
Process Subscription
Database
Billing
Module
Figure 33. Subscription module for billing.
Scenario 3: PromotionslDiscounts
The other variable in pricing models is the notion of promotions and
discounts. Different pricing models for the same service or product can be
established based on factors such as date of purchase, or discounts during
certain holidays. Management of such promotions makes the billing task
more complicated.
Scenario 4: Packaging
Let us assume that service "A" has a monthly subscription rate of "s 1".
Service "B" has a monthly subscription rate of "S2". Services "A" and "B"
may be compelling as a packaged single product, perhaps sold at a
discounted price "S3". Such flexibility will allow a wide variety of
combinations in packaging products and selling bundled pricing to users' .
7.3 Product Pricing
One approach to this pricing problem is to maintain different pricing

identifiers for each product with a unique pricing model. Thus service "A' at
a subscription rate of "P1" would be assigned a pricing ID "11", whereas the

same service "A" discounted at a rate "P2" would be assigned a new pricing
ID "12". The pricing database maintains the list of pricing IDs and the rules
for computing the amounts to be billed to the customer at different billing
dates. At the implementation level, each pricing ID is associated with a list
of products or services. Each product or service may be identifiable by a
product identifier. This enables the representation of a wide variety of
pricing model for a variety of bundling options. The main disadvantage of
this approach is the rapid growth of the number of pricing IDs with just a
limited number of services.
The subscription database contains a list of pricing IDs for each user.
Periodically, the subscription monitor queries the subscription database for
the list of charges to be made at a particular date, and the users to be
charged. This list is translated into a list of charges and the users electronic
payment account information. These charge sheets can either be routinely
pushed to a third party financial billing vendor, or pulled by the vendor and
the accounts charged effectively.
7.4 Delayed versus immediate charge
Let us consider the sequence of steps that actually occur during a charge
cycle. The simplest case is when a user purchases a single item and charges
immediately. The sequence of steps are:
a) User select item to purchase.

b) User agrees to pay amount for this item.
c) System validates user's electronic payment account (with billing
partner or financial institution). [Authorization step]
d) System requests charge on account. [Settlement step]
e) System acknowledges success of purchase to the user.
There are charges incurred by the billing party during these steps. Step
(c), which is the authorization step, requires that the system verify the
validity of the user's electronic payment account. Such verification may
include validation of a credit card, or verification of the current billing
address, and other security checks. Such validation is generally done with a
billing partner or a financial institution such as a credit card company, and
generally incurs some cost to the billing agency. In step (d), which is the
"settlement step", the account is charged for the appropriate amount, and this
action also generally incurs some cost. It is obvious that the profit on the sale
of the item should take into consideration such validation costs incurred.
What happens when a user buys multiple items? Would the billing
system repeatedly incur charges during operations in steps (c) and (d)? An
alternative is the "fund reservation" method, which is best illustrated with an
example. Assume that a user wants to purchase 5 items "A", "B", "C", "D",
and "E". Each item costs a dollar. The default approach is validation and
charging for each purchase. The "fund reservation" approach is as follows:
(a) Verify validity of user's electronic payment account. [Authorization]

(b) Reserve "X" dollars from user's account (e.g. X = 5 ) [Reserve Fund]
(c) Allow user to select the set of items.
(d) User selects "A", "B", "C", "D", "E".
(e) Verify total cost of purchase is less than or equal to reserved amount.
(t) User acknowledges amount charged for this purchase.
(g) System charges actual amount to user's account with billing
vendor/financial institution. [Settlement]
(h) System acknowledges successful completion of purchase.
Since the verification and the charging of the account are done only once
in the above scenario, this approach saves cost. This requires the additional
step of "fund reservation". Reserved funds are normally released if the
charge is not done within a specified duration of time. We can see this is
similar to the concept of micropayments that we discussed earlier.
7.5 Transaction Management

Each transaction is generally logged in a replicated database. The
database is used to generate account summaries, transactional reports, and
should provide means of rolling back erroneous or fraudulent transactions.
Often the transaction manager is so tightly integrated with the commerce
application module, that it is meaningful to consider it as a part of the
commerce application module itself. Some of the tasks of the transaction
manager is to ensure the proper processing and completion of orders, proper
computation of financial accounting information (such as taxation,
accounting details), and backup of information.
7.6 Fraud Controller
Where there is money, fraud follows. Fraud surfaces in many forms

ranging from abuse of promotions to misuse of stolen account information.
A simple illustration of abuse is as follows: if product "A" is promoted as
being free for the first month (and a charge after the first month), then
fraudulent user's can signup with different new accounts and get the service
for free by cancelling after the free month. This kind of fraud can be traced
in certain cases, and abuse prevented by blocking such malicious account
users. Fraud is not special to e-commerce, but rather is a side-effect of any
commerce activity. A credit card may be stolen, and used for purchases. In
such cases, accounts with the stolen credit cards need to be identified, and
access to the service blocked or terminated. Another issue is one of bad
credit. What if a user signs up with a valid credit card, but then the card
expires or is cancelled. The billing system in conjunction with the fraud
controller should be able to address such cases effectively in order to ensure
that the service is being delivered to the right customers, and not to
fraudulent users. Fraud control is an essential part of any profitable e-
commerce business.
7.7 Commerce Application module
The components that have been discussed so far are common to any
commerce activity. The core module that is unique to the actual commerce
application is the "Commerce Application" module. A paid email service or
a paid broadcast service would use the same common components such as
the wallet and the billing infrastructure. But each of these services will have
it's own application infrastructure such as management of the email store,
and authentication of user account. For instance, the shopping module would
maintain a database of items available for sale, the vendor the items are
available from, list price and so on. A store hosting application will need to
have components that allow store owners' to edit and manage their store
online, in addition to being able to upload inventories and fulfil orders. The
commerce application module may have another block that allows the user
to search through a list of items. Such e-commerce application specific
functionality is included in the commerce application module.
Examples of commerce applications include aggregation of shopping

content from different sites or databases, auctions, and online stores. There
are many common issues in developing commerce application modules:
• Distributed
• Redundant resources
• Fault tolerant design
• Diverse integration mechanisms
• Tight coupling with components such as transaction manager,
billing.
• Real-time processing
• Pipelined framework
• Atomicity and Recovery
• Inherent Security and Privacy features
• Application specific features
Many of the above concepts apply universally to designing scalable web

applications as we have seen in earlier discussions. A distributed design
enables functional clarity as well as ease of recovery, replication, monitoring
and error isolation. In the case of commerce applications such as shopping,
different functional blocks can be distributed on different separate units. For
instance, shopping cart management can be isolated to a separate unit, order
processing to another, search servers to another and so on. Furthermore,
storage and processing within each component can also be distributed into
different servers to improve performance. The second aspect is redundancy,
which includes redundancy of resources such as servers, and storage of
critical information (e.g. on multiple disksIRAID). It is important to analyse
distributed systems for points of failure, such as centralized data storage. If
this centralized storage goes down, the system should be able to continue to
operate seamlessly. What this entails is optimised replication and data
consistency procedures between different modules.
Commerce applications (especially aggregation applications) often need

to handle a diverse collection of data descriptions from varied sources.
Although XML has alleviated the problem to a large extent, data providers
may be reluctant to switch to application specific or non-standard formats.
For instance, in applications such as classifieds, different agents may submit
different formats of data with different fields. All this information must be
properly digested and incorporated together. As mentioned earlier, certain
commerce applications may need tight coupling with other modules such as
billing or the transaction manager. These components are isolated here to
indicate that they are separate functionally. An example may be in the way
orders are processed. As far as possible, different modules need to support
real-time processing, such as real-time transaction completion, or inventory
check. In cases where a real-time solution is not viable, the architecture
should enable pipelined processing. Thus, a user may request a purchase,
which is logged for further processing, and acknowledgement sent to the
user. Internally another module monitors this request queue, and applies
required processing, which is in turn passed to the next module (e.g.
shipping module). At every stage, it is important to have a notion of
atomicity. For instance, each "significant operation" (such as marking an
order as completed, or charging a user's credit card) should be atomic in the
following sense: operation should be logged appropriately; in case of failure,

the operation should be repeatable without any side-effects or errors.
Another important aspect of commerce application servers is the need to
have another layer of security and privacy protection. Since critical
information about users' are manipulated or used by the applications, it is
important to ensure privacy and security, independent of other
security/privacy features provided by other modules (such as a user
authentication module). Examples include not storing a user's credit card
information in some persistent storage in the application server (note that
this may be stored in modules that handle this storage such as wallet
servers), not storing user-sensitive information such as gender or age in
cookies, and even ensuring that such information is not passed from one
state to the next (in case a malicious person eavesdrops on the
communication). Lastly, there are many application specific components: In
shopping, for instance, specialized mechanisms of handling and classifying
different types of items into specific categories may be required. Another
example is a recommendation engine9 that analyses data and makes
recommendations that could be of interest to the user.
8. CONCLUSION
Communication and commerce areas are the two most important

applications of the World Wide Web. With the growing audience for
messaging application such as 1M, the Web has become an integral means of
communication to bring together millions of users' around the world. Mail
has been popular since the invention of the Internet in academic circles, but
gained mainstream adoption with the integration to the Web. Mail protocols
such as SMTP, POP and IMAP were discussed in this chapter. The issues in
Instant Messaging systems, and a prototype architecture was presented.
Next, an overview of commerce frameworks is covered, and XML standards
for payment, and interoperability between vendors are discussed. An
overview of an e-commerce platform is summarized and issues related to
product pricing, billing, and components such as fraud control are described.
FURTHER READING
Protocols such as SMTP, POP, and IMAP are covered in RFCs [RFC
2060], [RFC 1939], [RFC 2683] and [RFC 2821]. A good review of e-
9 Recommendation engines are discussed in the chapter on "Web Mining",

commerce platforms is [CEN/ISSS 2001]. Other financial interoperability

standards are covered in [IETF ECML 2001], [IETF 10TP Req 2001], and
[IETF Pay IOTP 2001].
EXERCISES
1) Discuss the SMTP, POP, and IMAP protocols.
2) Design and implement a prototype distributed e-mail server. Discuss

how a proxy can be used to distribute the e-mail storage across multiple
servers (e.g. mapping different subset of users' to different machines).
3) Implement a prototype 1M system. Discuss the advantages of using

UDP, and TCP in this architecture.
4) What are the standardization and architectural issues in building e-

commerce platforms?
5) Build a shopping site that can list upto 50 items. Any user visiting that
site can add items to hislher shopping cart. Implement a simulated wallet and
inventory that will allow users to "purchase" those items.
6) What are the issues in payment systems, authorization and

settlements? Propose an architecture that enables flexible pricing models,
multiple payments, and extensive billing and reporting capabilities.
Chapter 8
Mobile Access
The sky remains infinitely vacant for earth there to build its heaven with dreams
Abstract The rapid development of mobile technology has fuel1ed growth in the area of
wireless Web. Wireless Web technology enables access to information from
the World Wide Web on any mobile device: anywhere, anytime. In this
chapter, the technologies that make "Wireless Web" a reality are discussed,
ranging from mobile communication infrastructure to delivery of Wireless
Markup data. The Wireless Application Protocol (WAP) and the
corresponding Wireless Markup Language (WML) are presented with
examples. Methods of generating wireless markup including HTML
transcoding and XSLT are summarized. The important area of Short
Messaging Service (SMS) for mobile originated/terminated message delivery
and emergent trends in mobile access to information from the Internet are
covered.
Keywords: Global System for Mobile Communications (GSM), Wireless Application

Protocol (WAP), Wireless Markup Language (WML), Handheld Device
Markup Language (HDML), Compact HTML (cHTML), Short Messaging
Services (SMS), Transcoding Proxy, XSL approach to generation of wireless
content.

1. INTRODUCTION
In the last decade, the adoption of wireless devices has grown

immensely. Cell phones, pagers, and personal digital assistants (PDAs) are
popular devices for mobile access to information and communication. In the
last few years, with the rapid growth of the World Wide Web, the need for
mobile access to information from the Web has been compelling. This
ability to access information from the World Wide Web anywhere, anytime
on a wireless device is called "Wireless Web". Web applications that are
useful to users' on a personal computer can now be accessed on a mobile
device. Thus, users' can browse their email, check a stock quote or read the
news using Wireless Web technology. Technologies such as Short
Messaging Service (SMS) provide the ability to "push" information from/to
devices. For example, the user can be notified about an event when it occurs
(such as the receipt of an email). With advances in technology, the location
of a particular device within a few feet can be estimated, and location-
specific information can be rendered to the user. Thus, mobile Web provides
a powerful opportunity to provide a variety of services from the World Wide
Web to users on the move.
The technologies behind Wireless Web are diverse, ranging from cell
towers to XML generating web servers. In this chapter, the general
architecture and description of the Global System for Mobile communication
is presented. Next, the architecture of a popular protocol suite used for
implementing wireless applications called the Wireless Access Prqtocol
(WAP) is discussed. This leads to the description of the markup languages
used commonly in Wireless Web including Wireless Markup Language
(WML), HDML, and cHTML. Limitations of the current technology, and a
discussion of the current emergent standards such as the Third Generation
(3G) initiative are summarized.
2. MOBILE COMMUNICATION SYSTEMS
Analog cellular telephony was the first-generation technology of the

wireless industry. Advanced Mobile Phone Service (AMPS) was an early
analog technology introduced by AT&T. Frame Division Multiple Access
(FDMA) was developed in conjunction with AMPS to increase the capacity
by breaking up the frequency band into 30 channels.
Second generation (2G) [Clark 1999] wireless provides digital services

using technology with transmission speeds upto 14.4Kbps. The key
advantage of digital transmission methods is the increased bandwidth

achieved by the digitisation and compression of the data. Three common 2G
standards are Time Division Multiple Access (TDMA), Global System for
Mobile Communication (GSM), and Code Division Multiple Access
(CDMA), each of which operate in different broadcast spectrums. Data
communication occurs using wireless modems, data adapters, or new
connective technologies such as Bluetooth. Extensions on GSM such as
High-speed circuit switched data technology (HSCSD) and General Packet
Radio Services (GPRS) enable data transmission rates of upto 57.6 Kbs and
100 Kbs respectively.
Although digital cellular networks are gaining rapid popularity around

the world, a vast portion of the United States uses analog systems. We will
focus on the Global System for Mobile Communication (GSM) in the
following section to illustrate the general architecture. The GSM is a digital
system based on narrowband, Time Division Multiple Access (TDMA)
technology to provide a global wireless network primarily using the
frequencies of GSM 900, GSM 1900, or GSM 1800. The three components
of GSM[GSM-IEC] are the switching system, base station system, and the
operation and support system.
An overview of a GSM system is shown in Figure 34. The Base station

performs all functions related to transmitting and receiving radio signals.
Base stations comprise of Base Station Controllers (BSCs) and Base
Transceiver Station (BTS). Base Station Controllers comprise of high-
capacity switches that enable control functions and connection of the BTS to
the switching system. The BTS consists of the radio equipment such as
transceivers and antennas that facilitate each cell of the mobile network.
Base Station Controller
Base Transceiver Stations
,_._._._._._._.,
j I
I PSTN '
I '
I '
: PLMN !¢=:J I--......==;-.r
I I
I I
: PSPDN ; GIWU
'-'-'-'-'-'-'-'~
Figure 34. Overview of Global System for Mobile Communication
All handling of device infonnation, subscriber infonnation, and call

processing is done by the switching system. The switching system also
connects the mobile devices to other networks such as PSTN (the Public
Switched Telephone Network), PLMNs (Public Land Mobile Network), and
PSPDN (Packet Switched Public Data Network). The central co-ordinator
of the switching system is the Mobile Services Switching Center (MSC).
The MSC does all the telephony control, transfer, channel signalling and so
on. The switching system contains a number of databases for authentication,
equipment and user subscription infonnation. The Equipment Identity
Register (EIR) contains infonnation about the identity of the mobile device,
and can be used to prevent access from stolen or unauthorized devices. The
infonnation about each user is stored in the Home Location Register (HLR),
which includes subscriber profile, plan, location infonnation and other
subscriber infonnation. When a user moves out of home area, and enters a
"roaming" area, then the user's temporary infonnation is stored and
processed in a visitor location register (VLR). The MSC that the VLR is
connected to gets the information from the appropriate HLR.
3. WIRELESS APPLICATION PROTOCOL

Wireless Application Protocol (WAP)[WAP] is an application and
communications protocol suite that enables development of Wireless
applications that are connected to the Internet. While WAP uses Internet
standards such as XML, the underlying transmission is based on proprietary
binary protocols, and additional session information absent in traditional
HTTP. The binary transmission protocol is optimised for long latency and
low bandwidth, both of which are characteristic of mobile communication.
Thus the transmitted data is encoded during transport, and decoded upon
arrival.
A problem with mobile communication is the loss of connection, which

is alleviated by storage of session states. Conceptually, WAP acts as an
intermediary between the Internet and the mobile devices as shown in
Figure 35. As part of the WAP, the Wireless Markup Language (WML) and
WMLScript have been defined. WML and WMLScript provide the syntax
for passing information onto wireless devices. WML is an XML language.
Devices that support WAP have a WML browser that interprets and renders
the XML on the display of that device. In this architecture, the backend web
servers feed XML information that is processes by the WAP gateway, and
transmitted to the devices which then render this information to the user.
Wireless
Devices
r1
U.- ---~ '--_W_A_P_
_ Gateway .....I----\--+ I~~ I
Figure 35. WAP Architecture Overview.

At the heart of the WAP protocol is the WAP stack. The WAP protocol
stack is built using various protocols at different layers of the stack. This is
shown in Table 44.
Table 44. WAP Protocol Stack

Layer Protocol
Application Wireless Application Environment (WAE)
Session Wireless Session Protocol (WSE)
Transaction Wireless Transaction Protocol (WTP)
Security Wireless Transport Layer Security (WTLS)
Transport UDPIIP Wireless Datagram (WOP)
Network SMS/CDMAffDMAlPCS Wireless bearers ...
The different protocols include:
• WOP - Wireless Datagram Protocol

The data is transported across the wireless bearers using the Wireless
Datagram protocol.
• WTLS - Wireless Transport Layer Security

The data that is transmitted from the gateway to the handset is across
the air interface. Transmitting confidential or commerce related
information requires that the transport be secure from eavesdroppers.
WTLS is a security protocol built on top of WOP that includes
encryption of data that's transmitted to the handset or device. The
WTLS is based on Transport Layer Security Protocol (TLS, which is
derived from Secure Sockets Layer (SSL»). Thus, integrity of data,
authentication, and privacy are ensured.
• Wireless Transaction Protocol

Wireless transaction protocol provides a reliable mechanism of
transport over the WOP, in addition to various transaction related
functionality.
• Wireless Session Protocol

When mobile users move from one zone to another, connections are
sometimes dropped. In order to alleviate this problem, the wireless
session protocol allows the maintenance of light-weight session
information.
• HTTP Interface
An HTTP interface is required to fetch the WAP content from the

Internet as requested by the mobile device.
So why is there a new suite of protocols that have been defined in WAP?
Why were protocols for the Internet such as TCP not simply used? The
overhead of Internet-oriented protocols such as TCP can be avoided using
wireless specific protocols. For instance, reliable transport means different
things in TCP/IP and the wireless datagrams. TCP requires an overhead to
handle out-of-sequence packets. However this is redundant in wireless
datagram transmission since there is only one possible route between the
gateway and the handset. In practice, the WAP gateway maps incoming
requests to a number of WAP proxies. Most of the Internet related work is
done by the gateway including tasks such as DNS lookup. Another
advantage is the eschewing of a TCP stack at the handset/wireless device
that enables the WAP client to be very "thin" (Le. low processing power and
memory requirements). Furthermore, since a lot of the work is done at the
gateway, the amount of information transmitted is smaller than a
corresponding HTTP request, thus leading to savings in wireless bandwidth.
Furthermore, Wireless Telephony Application (WTA) allows call control
functionality. WAP also provides means for pushing information from a web
server to a specific device.
A final note on alternatives to WAP. It is important to understand that

WAP is just one way of mobile access to the Internet. WAP was built to
provide light-weight clients access to information from the Web. Mobile
devices such as Personal Digital Assistants (PDAs) often have extensive
computing power and support operating systems such as Microsoft's
Windows CE .NET. These devices are similar to the personal computer, and
can access the Internet through wireless modems or other connectivity
hardware and software. In such cases, client applications are built to directly
fetch and render XML documents, eliminating any need for WAP. Other
approaches include Binary Runtime Environment for Wireless (BREW)
from Qualcomm, and Java 2 Micro-Edition (J2ME) from Sun Microsystems.
With BREW, developers can create applications that can run on COMA
chipsets, and can be directly downloaded into the BREW-enabled device.
4. WIRELESS MARKUP LANGUAGES
4.1 WML
Wireless Markup Language was defined by the WAP Forum as a part of

the WAP protocol to enable WAP-compliant devices to access information
from the Internet. WML is based on XML and has a DTD specifying the
structure of a valid WML document. The WML pages are parsed and
rendered on the mobile device using a microbrowser that resides on the
phone or the wireless device. A brief overview of the elements and attributes
in WML 1.1 is covered in this section.
4.1.1 Cards
The first statement of a WML document is the xml version declaration

followed by the DTD specification.
<?xml version='l.O'?>
<!DOCTYPE wml PUBLIC "-/fWAPFORUMI/DTD WML 1.lI/EN"
''http://www.wapforum.orglDTD/wml_l.l.xml">
Every WML content is enclosed in the root element 'wml' by using

<wml> and </wml>. WML documents are organized into decks. Each deck
consists of multiple cards. The basic block of a WML document is a "card".
The card element has two attributes: id and title. The title attribute is used to
provide additional information about a particular card, and the
implementation of the title varies depending on the phone vendor. Some
phones display the title on the top of the screen. The 'id' attribute identifies
the particular card in this deck of cards. Thus, the 'id' attribute should be
unique for every card in this deck.
All text is enclosed in the <p> tag. The <p> tag represents a paragraph,
and all wml documents should have at least one block of text enclosed in the
<p> tags. A simple "hello world" example is shown below in Table 45.
Table 45. "HeIlo world" WML example.

<?xml version='1.0'?>
<!DOCTYPE wml PUBLIC "-/IWAPFORUM//DTD WML l.1!/EN"
''http://www.wapforum.orglDTD/wml_l.l.xm1''>
<card id="firstcard" title="Welcome">
<p>
Hello world
</p>
<Icard>
</wml>
4.1.2 Transitions
Jumping from one card to the next is done using the 'do' tag. The 'do'
tag has attributes 'type' and 'label'. The 'type' attribute allows mapping of
keys or actions on the device to certain actions. Upon receiving an key
action specified in the type, the browser goes to the card whose 'id' matches
the 'label' attribute. The different value for the 'type' attribute are accept,
prev, help, and options. It should be noted that the implementation of these
actions can vary depending on the device. Table 3 illustrates a transition
from one card to the next when the user clicks on the accept key on the
device.
Table 46. WML Example illustrating transitions from one card to the next.
<?xml version='I.O'?>
<!DOCTYPE wml PUBLIC "-/IWAPFORUM//DTD WML l.l/IEN"
<card id="firstcard" title="Welcome">
<do type="accept" label="Ok">
<go href="#nextcard"/>
</do>
<p>
Hello world
</p>
<Icard>
<card id="nextcard">
<do type="prev" label="Prev">
<go href="#firstcard"l>
</do>
<p>
In the next card.
<Icard>
</wml>
The example shown in Table 46 also illustrates the use of the 'go' tag.
The do specifies the action, and the embedded go element indicates which
card the control should transition to. Thus, the first card is displayed on the
device. The text "Hello world" appears on the screen. When the user press
an accept key (for instance the "ok" button on a Sprint phone), the next card
is loaded and the text "In the next card" is displayed. The card "nextcard"
has a 'go' tag within a 'do' tag that allows transition to the previous card. If
the user presses the 'previous' key (or equivalent), then the card "firstcard"
is loaded and the text "Hello world" displayed.
4.1.3 Anchors
The anchor tag is the same as the anchor tag in HTML. The anchor tag is
a child of the paragraph «p» element. An example of the anchor tag is
shown below in Table 47:
Table 47. Example of anchored text.

<card id="excard">
<p>
<a>
Click this
<go href="nextpage.wml"I>
</a>
</p>
<Icard>
When the user clicks on the text "Click this", then the next page specified
by the href attribute of the go tag is fetched and displayed to the user.
4.1.4 Input
Input from the user can be collected using the 'input' tag in WML. The
'input' tag has attributes 'type' and 'name'. The 'type' attribute indicates the
type of input collected (for instance 'text'). The 'name' attribute associates a
variable name with the input collected, and can be referenced later in another
card as shown below in Table 48. An input variable called 'secret' is
collected in card "firstcard". When the user presses the accept key, the card
"nextcard" is accessed and displayed to the user. The value of the variable is
referenced using "$(secret)" as shown in the card "nextcard". The 'do' tag in
this card contains a 'go' element which specifies the method of transition
(i.e. POST versus GET), and the URL to transition to. The 'postfield'
element specifies the name and value of the variables that will be posted to
the specified URI. It can be noted that the value of the parameter called
'mysecret' is set to the value of the input collected in the card "firstcard",
namely "$(secret)".
Table 48. Example of input collection and submission to backend server.

<!DOCTYPE wml PUBLIC "-IIWAPFORUMlJDTD WML 1.WEN"
<card id="firstcard">
<do type="accept">
<go href="#nextcard"l>
</do>
<p>
Enter your secret:
<input type="text" name="secret"l>
</p>
<Icard>
<card id="nextcard">
<do type="accept">
<go method="post" href=..www.wapserver.orglprocess.cgi..>
<postfield name="mysecret" value="$(secret)"I>
</go>
</do>
<p>
Your secret is: $(secret)
Click ok if you want to continue.
</p>
<Icard>
</wml>
Another way of getting input from the user is selection from a list of
values. This is achieved using the 'select' and 'option' elements as shown
below:
<p>
Age:
<select name="age">
<option value="child">O-13<1option>
<option value="teen">l4-l9</option>
<option value="adult">20-60</option>
<option value="senior">6l-l OO</option>
</select>
</p>
4.1.5 Images
WAP supports the Wireless Bit Map Picture fonnat (WBMP). The 'img'
element allows the inclusion of images in this fonnat in a paragraph element:
<p>
<img src="home.wbmp"/>
</p>
The value of the 'src' attribute indicates the image location. Similar to
the HTML 'img' tag, other attributes include alt, width and height.
4.1.6 Events and timers
WAP supports a no-action timer that allows automatic transitioning from

one WML page to another. This is achieved by specifying the 'ontimer'
attribute of a 'card' element. The duration of the timer can be set by using
the 'timer' tag with the value attribute to specify the actual duration.
<card id="testtimer" ontimer="#NextCard''>

<timer value="lOO"/>
<p>
You'll be automatically taken to the next card.
Wait...
</p>
<Icard>
4.2 WML2.0
The WML version 2.0 was released by the WAP Forum, and provides
extensions to the WMLl.l discussed in the preceding sections. The
motivation of WML2.0 is to extend the syntax and semantics of other
standards, namely XHTML Basic, and CSS Mobile Profile, while retaining
backward compatibility with WML 1. In that regard, the WML2 specification
includes all the elements, attributes, and attribute values of XHTML and
CSS, while retaining only those elements from WML 1 that cannot be
expressed with XHTML, and CSS. These WMLl compatible elements are
prefixed with "wml:", and include the following:
• wmI:access
• wml:anchor
• wml:card
• wml:do
• wml:getvar
• wml:go
• wmI:noop
• wml:onevent
• wml:postfield
• wml:prev
• wml:refresh
• wml:setvar
• wml:timer
Furthermore, XHTML elements body, html, img, input, meta, option, p,

select, textarea have additional WML 1 related attributes. The reader can get
detailed descriptions on the WML2 standard from the WAP Forum (
http://www.wapforum.com ).
4.3 HDML, cHTML

WML is a standard markup language supported by WAP. Handheld
Device Markup Language (HDML) [HDML] was introduced in 1997 by
Unwired Planet, and later submitted for the World Wide Web Consortium
for consideration. This language is a precursor to the WAP standardization
effort. Unlike WML, HDML is not based on XML, and therefore does not
have a DTD or a schema associated with it. HDML does not also support
scripting.
Another standard competitive to WML is the compact HTML (cHTML)

[cHTML]. Rather than create a new set of elements, and attributes, cHTML
attempted to identify a subset of an existing standard that would be suitable
for presentation of information on devices with limited display capabilities
such as mobile phones, and PDAs. An additional motivation for using a
subset of HTML is to eschew the need for generating WML pages or
converting existing HTML pages into WML/HDML. Furthermore, compact
HTML is tailored towards devices with limited display, memory and
computation capabilities. Thus, tables and frames that require large amounts
of memory are discarded. Additionally image maps, background color and
image, multiple character fonts, and stylesheets are not supported in
cHTML. CHTML also restricts input buffers to 512 bytes, and SELECT
buffers to 4Kbytes.
5. GENERATING WIRELESS CONTENT
5.1 Introduction
We have seen how infonnation is accessible on wireless devices by

providing XML data in an appropriate wireless markup format. How do web
servers generate such data? The Web is driven by other markup standards
like HTML, and an important issue is how do systems maintain multiple
XML language formats. There are three primary approaches to generating
wireless markup content: transcoding from HTML, use Style Sheets, or
directly generating wireless markup content. In the first approach, existing
HTML pages are parsed and patterns used to extract data components. These
components are then presented in the appropriate wireless data markup
fonnats. The second approach is the XSL approach, which was discussed in
an earlier chapter on data markup. In the XSL approach, each wireless
markup language is represented by style sheets, and the same XML data is
rendered using different style sheets on different devices. In practice, the
ideal solution lies in between the two approaches, although methods that are
closer to the XSL approach are efficient, scalable and easier to maintain. The
two approaches are illustrated in Figure 36.
Transcoding M:lbile
mML Proxy r
~ce
WfiJ
Ser\m
/
XML
:style XSLT Processor
Figure 36. Two approaches to generating wireless markup.
5.2 Transcoding HTML

Early approaches to the generation of wireless markup primarily focussed
on conversion of existing HTML pages to a wireless markup language, such
as WML. There are two primary advantages with this approach:
a) HTML is a universal standard with hundreds of millions of pages
already available in this markup.
b) Instead of needing multiple systems to separately generate the same
content in different wireless markup languages, it is easier to
transform the same HTML page into the different markup languages.
Transforming HTML documents[Hori et al 2000,Kaasinen et al 2000] to

a wireless markup format is achieved by "transcoding proxies". The term
"transcoding" is used to highlight the purpose of such servers; namely to
transform from one (markup) format to another. These servers are "proxies"
since they do not contain any original information; rather they act as an
intermediary to convert, cache and pass on web content in the form of
appropriately formatted wireless markup. In reality, transcoding proxies
handle many more features such as maintaining state information, efficient
parsing of the document, and extraction of relevant parts of the original
document.
While the generalized approach of transfonning HTML to WML seems

elegant, the utility of this approach is limited due to a variety of difficulties.
The HTML page is designed for display on a large screen with browsers
running on high memory, high computing power computers. It is difficult to
transform information presented on a large display onto to a small screen.
For instance, breaking an HTML document into different "frames" might
make sense, but would be meaningless for a display that is 5x40 characters.
Furthermore, it has been difficult to standardize and ensure that the HTML
pages are valid pages. The source of the HTML pages are varied ranging
from human authored HTML pages to software generated HTML. While
standards such as XHTML force the pages to be compliant with the XHTML
document structure, many of the existing HTML pages seldom follow any
specific standard of authoring. In reality, it is difficult to specify a standard
that allows markup of HTML tags so that the document structure is
meaningful for other devices such as mobile devices with limited displays.
For instance, some authors may highlight important headers by embedding
<HI> tags, while others may not. How does one break-up the content of a
large HTML page into meaningful sequence of cards that are displayed to
the user? How are fields in a HTML form mapped to the counterparts in a
wireless markup format?
The general steps involved in a transcoding proxy are listed below:

i) Pre-processing ofHTML document.
ii) Extraction of document structure.
iii) Application of HTML to WML transformation rules.
iv) Segmentation of data into cards.
v) Generation and caching ofWML documents.
Figure 37 shows the building blocks of the transcoding approach. The first
step is the processing of the HTML page. HTML pages contain a number of
tags that are pertinent to devices with large resolution and high graphics
capability displays. In many cases, tags are used inappropriately for visual
rendering effects rather than specification of document structure. For
instance, <HI> tags may be used instead of using <FONT> to highlight text.
Although the adoption of XHTML and separation of content and style using
stylesheets alleviates this issue, it is still difficult to standardize the
utilization of HTML tags for text markup. Cleanup includes removal of
frames and framesets, removal of meta tags, elimination of image and audio.
The next step is parsing of the pre-processed HTML page in order to
identify the document structure. The result of parsing is a document tree
with each internal node in the tree representing a tag and each leaf
corresponding text data. Transcoding rules determine how this tree
document structure extracted from the HTML page maps to a similar WML
document tree. For instance, the transcoding proxy may generate a WML
page listing all the header (HI) level text as a menu. When the user chooses
an item from this menu, the page transitions to another WML card with the
text that follows this header line in the original HTML page. The
transformation process may be enhanced by including meta tag information
to the HTML page in order to aid the transcoding process [Hori et al 2000].
The next issue is division of the original document into cards. Since the
display size and device memory of many mobile devices are smaller than
desktop monitors, the text that is displayed needs to be reduced to smaller
chunks. Furthermore, the sequence between these chunks need to be
maintained, whether between cards or WML pages. The last step of the
transcoding proxy is the caching of the WML documents returned to the
clients.
1-"-"-"-"-"-"-"-"-"
I. DxumtJ\Snmre I~
c:=>l i
I
I
Figure 37. Transcoding Proxy Architecture.

5.3 XSL Approach

The second approach is utilization of XML and XSL technology to
generate pages in a variety of wireless markup languages. In chapter 2, we
discussed how extensible Style Sheets can be used in conjunction with
extensible Markup Language in order to separate content from presentation
style. The combination of XSLT and XML provides a general framework for
the generation of a variety of wireless markup documents from a cornmon
set of data representations. Figure 38 illustrates how a single XML
document can be used in conjunction with different style sheets to generate
WML, HDML, and other wireless markup formats.
......
WML WAP
StyleSheet Device
Common ~
HDML HDML
XML ~ StyleSheet ... Device
DATA
~ cHTML
StyleSheet ...
PDAs
Figure 38. XSLT Approach to Wireless Markup Document Generation.
Let us consider the XML document shown in Table 49. The root element
is "EmployeeRecord" which contains the element "Name". Table 50 shows
a style sheet (wml.xsl) that is used to transform the XML document in Table
49 to the following WML :
<wml>
<card>
<p>
Albert Tan
</p>
<Icard>
</wml>
Table 49. Example XML Document

<?xml version=" 1.0"?>
<!DOCTVPE SYSTEM "ER.dtd">
<!-XSL stylesheet specified below?

<?xml-stylesheet type="text/xsl" href="wml.xsl"?>
<EmployeeRecord>
</Emp1oyeeRecord>
Table 50. Example XSL stylesheet for generating WML.

<!-Define this document to be a stylesheet and the namespace 7

<xsl:stylesheet version=" 1.0" xmlns:xsl=http://www.w3.org/xsl>
<!-Apply stylesheet to whole document 7

<xsl:template match="/,,>
<wml>
<card>
<p>
<!-Loop over each EmployeeRecord element 7

<xsl:for-each select="EmployeeRecord">
<!- Insert the data of the Name element 7

<xsl:value-of select="Name"/>
</xsl:for-each>
<lp>
<Icard>
</wml>
<!-- End ofTemplate Tag 7

</xsl:template>
</xsl:stylesheet>
<!-Ending stylesheet tag 7
The same XML document is used to generate a wide variety of wireless

markup using XSL transfonnations. Unfortunately, it is not easy to develop
a single stylesheet for each markup language. In practice, the

transformations required are quite complex, and are often simplified by
using task information. For instance, if the developer knows that the XML
data is for stock quotes, then a specific XSL stylesheet can be developed,
which is different from another stylesheet used for rendering sports data in
WML. In the end, the complexity lies both in the development and
management of these stylesheets, in a manner that allows efficient sharing of
stylesheets across tasks, while limiting complexity.
6. SHORT MESSAGING SERVICE

Short Messaging Service (SMS)[SMS-IETF] is a global service for
transmitting alphanumeric messages between mobile subscribers and other
messaging systems such as pagers, e-mail and voice-mail systems. GSM
included SMS at the onset in 1991. SMS allows guaranteed delivery of
messages such as notifications and alerts. SMS provides a reliable and low-
cost communication mechanism that allows seamless integration of data and
messaging services, and for transmission of short messages to/from wireless
devices. SMS also supports point-to-point services: mobile-originated short
message(MO-SM), and mobile-terminated short message (MT-SM). MO-
SM capable devices will be able to send short messages to other mobile
subscribers, or subscribers of pagers or on other networks (including the
Internet). MT-SM enables delivery of messages to the device from the
SMSC, and these messages can be sent to the SMSC by another mobile
device, or other short messaging entities such as paging systems, voicemail
systems, or messaging servers on the Web. The service is called "short
messaging" service since the message length is limited to a few hundred
bytes (e.g. 190 characters in a GSM network). In addition to integration with
voicemail, e-mail, and paging systems, SMS integrates seamlessly with
information services and also support WAP messages. SMS supports push
and pull approaches which enables selective delivery based on an incoming
request.
The GSM architecture shown in figure 1 can be extended to include the

following components in order to support SMS. The main component of an
SMS system is an Short Message Service Center (SMSC) that integrates
with the mobile network through a Signal Transfer Point (STP). A STP is an
networking element that allows connections over a SS7 based signalling
network.
Two important aspects of SMS are:

a) Low-bandwidth message transmission

b) Out-of-band packet delivery
The out-of-band delivery allows the wireless device to send or receive a

short message even if a voice or data call is in progress. SMSC is comprised
of hardware and software that enables the storage and transmission of
messages between short messaging entities (such as email, voicemail, and
other entities) and the mobile devices. In the case of GSM, the HLR and
VLR databases are used for accessing subscriber information. Figure 39
shows an overview of the SMS architecture.
WebSMS
Entities
Figure 39. Overview of SMS Architecture.
SMS messages typically have the following three elements: message

expiration information, message priority, and message escalation
information. The message expiration information indicates the duration of
validity of a message. This is useful when a message delivery fails, and the
message needs to be held for retransmission. The second feature, message
priority, allows the distinction of urgent messages and normal messages.
During transmission, the SMSC gives higher priority in processing the
urgent messages when compared with normal messages. What happens
when the expiration time of the message is reached, and the message has not
been transmitted? The message escalation element allows the SMSC to have
alternate escalation procedures, such as sending the message to the user
through an alternate system (such as a paging network).
Signalling for SMS is achieved using a standard mobile application part

(MAP) that is built on top of the transactional capabilities application part
(CAP) in SS7. The MAP layer provides functionality to support SMS:
• Routing Infonnation Request

• Point-to-Point Message Delivery
• SMS Message Waiting indication
• Service Center Alert
The fIrst step in the transmission of an SMS message is the determination

of the routing information. The SMSC retrieves this information by querying
the destination device's HLR. With this information, the SMSC determines
the appropriate MSC servicing the destination device, and attempts to
transmit the message. This transmission is achieved by the point-to-point
message delivery operations in MAP. This delivery is confIrmed; the result
is either success or failure with the various reasons associated with it (for
instance, temporary failure of network or device). In case of failure, the
SMSC sends a request to the destination device's HLR to notify when the
device is available for service. This request is achieved using the SMS
message waiting indication functions. Lastly, the HLR needs a mechanism
using the service center alert functionality of the MAP. A sequence of
requests that occur for the delivery of a short message from a Web Mobile
Service Entity to a particular mobile device is listed in Table 51.
Table 51. Steps involved in transmission of a SMS message to a mobile device (GSM).
1. Short message submitted from Web Short Message Entity to SMSC.
2. SMSC processes message, then requests routing information for this mobile subscriber
from the HLR.
3. HLR sends routing information: MSC handling the mobile subscriber.
4. SMSC sends the short message to the MSC. (forward Short Message Operation)
5. MSC receives message., and gets subscriber information from the VLR.
6. MSC forwards short message to appropriate mobile station. (forward Message Operation)
7. Result of the forward message operation is transmitted from theMSC back to the SMSC.
8. Delivery status is sent bv the SMSC to the Web Short Message Entity (if requested)
A similar sequence of operations are involved in the transmission of a

short message from a mobile device to a short messaging entity or another
mobile device. Addressing of the mobile device can take various forms,
ranging from alphanumeric codes to expanded phone numbers- all are
generally in the form of an email address which enables seamless integration
with e-mail servers. An example for a Sprint PCS customer with phone
number 'xyz' would be xyz@wireless.sprintpcs.com. The Web Short
Messaging Entity would process all the information for subscribers, and
generate alert short message lists that need to be delivered to the SMSC.
This can be done using an email operation to the appropriate SMS device
destination address. The key difficulty in this process is ensuring that SMS
(also called alerts) are sent on time. If 10,000 subscribers request notification
when a particular stock hits a particular high, then ideally all the alerts need
to be sent at around the same time (within a very short time window). The
number of processes or threads that must handle such notifications needs to
be tuned based on the average and maximum delivery loads on each system.
7. EMERGING TRENDS
What are the limitations of Wireless Web? Mobile communication

coupled with vast information access from the World Wide Web seems to be
an indispensable application. The limit of wireless web is set only by
technology. Imagine streaming media that you can access on your wireless
device, or sharing of family picture album on your phone: these are possible,
but not ready for the end consumer due to two main reasons: input/output
limitations and bandwidth. Emerging trends to address this and other
personalization issues are presented in this section
7.1 User Interface

An important aspect of wireless web is the user interface. Phone displays
are very small (even as small as 3 x 20 characters), and the number of keys
limited. This is sufficient for skimming through a short email, or browsing a
list of newly released movies, but for searching through a complex list of
items such as yellow pages, the display and keypad limitations may hinder
ease of use.
Next, telephone keypads are alphanumerically coded: key 2 maps to 3

letters 'ABC'. A common technique used is called "tapping" whereby the
user taps the same key multiple times to skip from one character to the next
(and to variants such as uppercase letters). While such approaches can be
refined by word pattern analysis (by probabilistically weighting letter
occurrences in string, also called "predictive typing"), there are fundamental
limitations to the device input and output capabilities that restricts use.
An emerging technology that aims to alleviate this problem is speech

recognition, whereby the user input can be spoken into the device. Standards
such as VoiceXML[VoiceXML] are being used extensively to develop
speech-enabled applications. However, current technology does not allow
seamless integration of voice and data access. Furthermore, multi-modal
access standards for data and voice are not yet established and endorsed as
standards.
7.2 Higher Bandwidth
Another important limitation of mobile technology is the bandwidth.

GSM offers bit rates at 9.6 Kbps, and the slow transmission rates limit the
practical use of numerous wireless applications that require higher
bandwidths. General Packet Radio Service (GPRS) is a standard for wireless
communication that offers transmission speeds of over lOOKbps. This is
especially useful for high-bandwidth applications such as e-mail access, or
retaining a "permanent" connection to the device. GPRS is a stepping stone
for the larger third generation wireless (3G) initiative. High-speed circuit-
switched data technology (HSCSD) and GPRS are GSM extensions that use
packet interleaving in order to achieve high bandwidths, and are referred to
as 2.5G technologies.
With 3G, bandwidths in the range of 144 Kbps (for faster moving
mobile users) to 384 Kbps (for slower moving users) are expected. High
bandwidths open up a multitude of wireless applications: the phone can be a
music player with streaming audio downloaded from the Web, or it can be
you "always-on" connection to the Internet. Such grand visions which are
driving 3G are becoming a possibly near-future technology. Carriers such as
NIT DoCoMo in Japan are already building (i-mode) services around high-
bandwidth technology, with specialized devices that offer a wide range of
information access and services.
7.3 81M Cards
User identity is an important concept in mobile services. If a user travels

from one location to another, he needs access to his personalized
information, irrespective of which device he uses. Smart cards maintain and
provide access to the user's identity in a manner that is separated from the
device. Thus, the user can carry his Subscriber Identity Module (SIM) card [
SIM-IEC] and plug it into any device to have immediate access to his
personalized information. Access to subscriber information can be done
using intelligent networks, but SIM cards provide a graceful and scalable
solution along with security and reliability. Furthermore, if 'smart cards'
integrate a variety of personal information including credit card, bank card,
phone subscription, personalized web services, and other user information,

the suite of applications and opportunities are enormous.
7.4 Global Positioning Systems (GPS)
While SIM cards enable access to personalized information, Global

Positioning Systems (GPS) allow location-specific information to be
delivered to the user. New laws (such as the FCC e-911) require that the
location of a mobile device be identified within 50 meters. Global
positioning systems generally use satellites in order to estimate the location
of a device, and communicates that to a chip that resides on the device. The
combination of mobility, personalization and location access provides a
great opportunity for services that will fundamentally change the way
wireless information access will be used.
7.5 Streaming High bandwidth Content
With the high bandwidth available through 3G technologies such as

HSCSD and GPRS, streaming video content to the phone is a reality. This
can transform the manner in which high quality audio and video are
broadcast and can change the way in which mobile communication is done.
Companies like DoCoMo are already working on the fourth generation of
networks (4G), with expected transmission speeds of 100Mbps downlink
and 20Mbps uplink. This is about 260 times faster than the recently
introduced 3G wireless network.
7.6 Always On: I-mode
I-mode is an always-on, packet switched service for the Internet provided

by the Nippon Telephone and Telegraph DoCoMo (NTTDoCoMo). The
service was launched in early 1999, and has been fairly successful. The main
advantage for NTT is the 3G bandwidth combined with a packet switched
network (like the Internet). The DoCoMo handsets have a cHTML browser
built in, and are in color. In the I-mode model, the user is always connected
to the service, but is charged only for the amount of data that is actually
downloaded. I-Mode also has 256-color capability, and does not need any
special server or gateway.
7.7 Device features
With the amazing advancement in the development of mobile

devices with many enhanced features on a daily basis, the opportunities
of mobile technology are just at the beginning stages. [WWW Mobile
sites] lists a few mobile sites (note these are normal sites that also
provide instructions on how to get to these sites on a mobile device),
and sites of mobile or device manufacturers. An example is one of the
recent releases from Samsung: a CDMA mobile phone with a color
display and embedded charge coupled-device (CCD) camera.
8. CONCLUSION
Mobile technology is growing rapidly both in terms of device capability

and applications. The ability to access information anywhere, anytime, on a
number of devices is a revolution in communication. Wireless web is built
on both the mobile network infrastructure as well as application
development suites. WAP is one such protocol suite that enables the
transmission of web content in the form of wireless markup language. It is
important to have a single information format and different styles or formats
of rendering this information. Short messaging service is another technology
that enables the transmission of messages to and from mobile devices. With
increased bandwidth, voice activated capabilities, enhanced display screens,
and location-specific information, mobile technology coupled with the
information on the Web provides a powerful mode of communication.
FURTHER READING
[GSM-IEC] is a good tutorial on GSM architecture. The WAP forum

[WAP] is the authoritative source of WAP and WML specifications. The
W3C also has many submitted standards and is a good source for ongoing
standardization efforts in this arena, and maintains a repository of earlier
work submitted for consideration by W3C (such as [cHTML], and
[HDML]). Furthermore, working groups such as W3C Voice Activity
publish emergent standards such as VoiceXML2.0 [VoiceXML]. SMS is
well-covered in tutorials from [SMS-IEC], while [Gutherey 2001] covers the
underlying protocols comprehensively. There are many books that go in
depth WAP and especially WML such as [Foo et al 2000], and [Frost 2000].
[Hori et al 2000], and [Kassinen et al 2000] represent sample recent work on
the generation of wireless markup content, including transcoding from

HTML. GPRS and 3G technologies are discussed in [Andersson 2001].
[WWW Mobile sites] covers a interesting set of applications, and examples
of device manufacturers sites which showcase upcoming devices.
EXERCISES
1. Answer the following to be "True" or "False"?
a) WML and HDML are both XML based.
b) HTML and cHTML are not related.
c) WML is a markup language supported in WAP.
d) Communication between the mobile device and the WAP
gateway is using HTTP.
e) SMS guarantees delivery of the message to the mobile device.
2. Write a WML page that allows you to browse through a list of courses
that you have taken this semester. Upon selection of a particular course, the
location of that course must be displayed on the screen. This assignment will
require the download of a WML phone simulator (e.g. from Unwired Planet
or Phone.com).
3. The following example structure represents the tabulation of inventory

in a certain company:
Example of an XML Inventory.

<Inventory>
<Department>
<Name>
Sales
</Name>
<Computer>
11
</Computer>
<Fax>
2
<!Fax>
<Printer>
3
<!Printer>
<Copier>
1
</Copier>
</Department>
<JInventory>
a) Write a program that generates HTML pages listing the set of

departments, given an XML as shown in the table. The user can select a
department, and press a button to display the inventory list for that
department. Choosing a particular inventory type displays the number of
items of the inventory in that department.
b) Write a style sheet that transforms the XML to WML.
c) Write a simple transcoding program that takes the HTML generated in (a)
and converts it to WML.
4. Write a program that simulates a SMS alert generator. This alert generator
periodically refreshes values of stock tickers from a file, and generates an
alert to the set of users when the value of the stock increases by 20%
Simulate the SMS by using email to deliver the message.
5. Assume that there is a database with the following information:

Restaurant Name
Restaurant Type
Location (assume location as a co-ordinate in an integer 25x25 grid)
Write a program to generate WML pages for browsing restaurant lists.
6. For each simulated request in (5), map the device request randomly to a
location in the 25x25 integer grid. Based on this location, generate a WML
page that lists the restaurants located only at that location.
7. What are the advantages of the WAP architecture? Why is a proprietary

transmission format used between the device and gateway? Contrast WAP
with alternate approaches.
8. An important aspect of servers generating wireless markup is storage of

state information. Assuming that each device is identified by an unique
devicelD, and this identifier is accessible with every request, discuss how
state information can be stored on the server side. Are there any limitations
to the approaches presented to store session state information? How would
clients that get disconnected while roaming be handled?
Chapter 9
Web Services
Truth loves its limits for there it meets the beautiful
Abstract: Can the notion of web service be abstracted further? Can there be a dynamic
notion of service registration and discovery in order to use and combine
different services from different parties? What are the emergent standards that
allow such a degree of interoperability in Web services. Protocols at transport
layer such as SOAP, service description language such as WSDL, are
discussed in this chapter.
Keywords: Simple Object Access Protocol (SOAP), Universal Description Discovery and
Integration (UDDI), Web Services Description Language (WSDL), .NET

1. INTRODUCTION
With the proliferation of a wide variety of web applications, do we

need to bind the service with a particular site? For instance, a company
can provide a stock quote service that can be integrated by other
companies. In fact, underneath the covers, companies do share services
and infonnation in order to present in their web sites. For instance, a
bank may provide fInancial transaction service that is integrated with
another shopping service provider in their web site. Currently, a number
of such integrations are achieved using a diverse set of technologies,
and no specifIc standards.
The Web service effort essentially is an attempt to streamline the

integration standards that allow interoperability of services in different
platforms, for tight integration with applications. Thus, one company
may provide a stock quote service. Another web site vendor or a
wireless application developer can invoke this service to display stock
quote infonnation on their site of application client. In addition to
enabling easier integration, such standards will also provide revenue
generating opportunities in the fonn of service licensing, pay-per-use or
other business models. The main point is that the web service
specifIcation allows the ability to discover services, build applications
using such services quickly, and execute these services in a platfonn-
independent manner.
2. OVERVIEW OF ARCHITECTURE
At the heart of e-services are the following fundamental issues:

a) Remote execution
b) Platfonn independence
c) Ease of integration
The above aspects are not new, especially in light of the huge growth and
interest in the Web over the last ten years. Remote method invocation,
238
distributed computing using DCOM 10 , and CORBA 1I have been around for a
while now. However some of these protocols cannot be supported over
HTTP. Sun presented Java as a means of achieving "write once, run
anywhere" by introducing the concepts of bounded NM J2 , just-in-time Java
compilers, and platform independent executable byte codes. DCOM objects
exposed their methods, and can be integrated through IOL (Interface
defmition Language). In light of all the earlier efforts on remote execution,
protocols such as HTTP, integration between multiple application servers
that are distributed all over the world, and platform independent execution,
what is new in the notion around "web services"? How can web services
add value to existing frameworks, and what areas of opportunity exists in
this area?
Inherently, there is nothing dramatically new about web services. A focus

on standardization, and interoperability, coupled with highly advertised
efforts from major players like Sun, IBM, Hewlett-Packard, and Microsoft,
in some sense, gave second birth to the same ideas that have existed for a
long time. On the flip side, such standardization efforts will shape and
influence the future of the Web as we see it today. Various platforms in
support of web services are prevalent today such as IBM's WebSphere,
Microsoft's .NET initiative, and Sun's Open Net Environment (SunONETM).
The goal of a web service infrastructure is to simplify end-to-end

integration and interoperability. This is achieved by abstracting the e-service
architecture into four components:
• Discovery
Discovery is the mechanism by which services make
themselves known. This is typically achieved by having the
service registered using UDm (Universal Description,
Discovery, and Integration). The UDm defines a way to
publish and discover information about Web Services (in
some sense analogous to DNS for mapping domain names
on the Web)
• Description
10 DCOM stands for Distributed Component Object Model defined by Microsoft for the
invocation of applications in a distributed, networked environment.
II Common Object Request Broker Architecture (CORBA) is an architecture and
specification for creating, managing and distributing objects in a networked environment.
12 NM is an acronym for the Java Virtual Machine, which enables the execution of platform-
independent Java bytecode.
While the UDm serves to manage the registered Web

services, it is important to have a standard method of
describing the Web service itself. This is typically done
using XML, an emerging standard being the Web Service
Description Language (WSDL). WSDL is a general purpose
XML language for describing the abstract interfaces and
protocol bindings of Web services distributed across the
Web and accessed in a end-to-end manner.
• Transport
The communication method between the user of the
service and the service itself is called transport. Simple
Object Access Protocol (SOAP) is an emergent standard
defining the protocol for the exchange of information in a
de-centralized, distributed environment.
• Environment
The runtime in which the web services execute is
referred to as the environment. Typically, Just-In-Time (JIT)
compilers convert machine independent code to machine
dependent code in order to generate executable application
code.
UDDI
SOAP
XML
HTTP, TCP/IP
Figure 40. Web services protocol Stack
The interoperability stack can be defined as that layer of the emergent

Web infrastructure that leverages off standards-based technologies such as
TCP/IP, HTTP, XML, and SOAP to create a clearly defined service
discovery protocol in conjunction with a uniform service description format
as shown in Figure 40.
3. UDDI
Universal Description Discovery Integration (UDDI) makes it easy for

business to share infonnation and publish preferred means of doing business,
and interoperate with parents over the Internet. In particular, UDDI provides
a framework where businesses can describe their services and processes in a
global, open environment.
At the core of the UDDI is the UDDI business registration that is

essentially an XML file used to describe a business and its Web services.
Typically, the UDDI provides not only infonnation about the company's
contact and address infonnation but also more importantly the services that
the company offers over the Internet, that can be quickly integrated with
partner applications.
Thus, any application that wants to use a web service for a specific
functionality will have to go to the UDDI Business Registry, and locate
information about such services. Conversely, companies that want to make
available their services will need to register with the UDDI registry.
It is important to note that the UDDI provides a standardized fonnat for

programmatic business and service discovery. It does not allow complex
tasks such as searching for vendors that offer a particular service within a
price range and operational on a specific platfonn.
The core component of the UDDI entry is an XML schema that defines
four kinds of infonnation: business infonnation, service infonnation, binding
information, and service specs. These are identified by the businessEntity,
businessService, bindingTemplate, and tModel elements respectively. The
bindingTemplate generally specifies the information required to actually
invoke the service, while the technical details are accessible via the tModel
which is metadata about a specification including name, publishing
organization and URLs to the specifications.
The UDDI specification includes the programmers' API that is composed

of two parts: Inquiry and Publishers' API. The Inquiry API enables
construction of programs that let you search and browse information found
in a UDDI registry, and also provides information on failure of web service
invocations. The Publishers' API allows the programmer to maintain the
businessEntity or the tModel structure for a particular service. Last but not
the least, an important aspect of the UDDI is restricted access to authorized
individuals to publish or change infonnation within the UDDI business

registry.
Locate
businessEntity
Infonnation
Extract Use tModel to get

programmer
bindingTemplate integration details
Figure 41. Web service integration procedure.
4. SOAP
Simple Object Access Protocol (SOAP) is a framework that allows one

program to invoke service interfaces across the Internet without needing a
shared distributed object infrastructure. Again SOAP is an XML based
protocol, and consists of the following parts:
a) Envelope
b) Encoding rules
c) RPC spec.
d) Binding convention
A SOAP message is an XML document that consists of a mandatory

envelope, optional header, and a mandatory SOAP body. The SOAP
envelope consists of the topmost element of the XML document for a SOAP
message. Information can be transmitted across communicating bodies by
specifying in the SOAP header: such as SOAP sender, SOAP receiver and
intennediaries. SOAP body defines the mandatory information that is being
transmitted from the SOAP sender to the SOAP receiver.
The SOAP encoding style indicates the serialization method to be used

for exchanging instances of application-defined data types. SOAP header
attributes indicate how the SOAP receiver should process an incoming

SOAP message. The mustUnderstand attribute of the SOAP header indicates
whether the destination SOAP node must process the header (or not process
and fail), or if this is optional. Mandatory information that needs to be
transmitted to the SOAP receiver is exchanged via the SOAP body. Error
conditions are handled by the specification of the SOAP Fault within the
SOAP body.
An example of a SOAP request is shown below:

<env:Envelope xmlns:env=''http://www.w3.org/2001/12/soap-
envelope">
<env:Body>
<GetInventory>
<Item>Laptop computers</Item>
</GetInventory>
</env:Body>
</env:Envelope>
The first thing to note is that SOAP is in XML. The Envelope is the
root element of a SOAP message. The SOAP header can contain additional
information about the message. This is followed by a SOAP body which
carries the actual SOAP message. In the above example, we can see that
application specific elements (i.e. GetInventory) are embedded in the SOAP
body. SOAP also provides a mechanism of handling fault situations that can
arise in the handling or processing of messages. This is returned in response
messages using the fault elements:
<env:Fault>
<faultcode> env:Receiver </faultcode>
<faultstring>Processing error</faultstring>
</env:Fault>
Since SOAP headers are not part of the SOAP message, they may be ignored
by the SOAP processor. However, SOAP processors can be notified that the
SOAP headers must be processed by setting the "mustUnderstand" attribute
in the header elements to true.
5. PLATFORMS
The fourth component of web service is the platfonn. Typically a

platfonn needs to have the following features in order to be a successful
web service platfonn:
a) Interoperate using HTTP/SOAP
b) XML Based schema
c) Language Neutral (e.g. using a Just-In-Time Compiler, and machine
independent byte code)
Microsoft has released versions of its own web services platfonn called
.NET as a part of the Windows XP initiative..NET includes a modified C
language called C#, which allows programmers to tie into the .NET features.
The .NET platfonn is bundled with a Just-In-Time compiler and integrates
with MSIL (Microsoft Intennediary language). This is not a new concept
and is similar to the Java bytecode "compile-once/ron anywhere" effort. C#
is strikingly similar to Java in features such as XML support, implicit
garbage collection, and type-safe variables.
Furthennore, a suite of web services will be bundled into the Windows

.NET platfonns called Hailstonn. Hailstonn services will include a variety
of basic applications such as user authentication (Passport), security, and
data models common to all Hailstonn applications. Other services in
Hailstonn include calendar, location, messaging and profile infonnation. As
the suite of services grow, it could very much cover all the common
applications developed all over the Web currently offered through Web sites,
and through alternative devices and phones. Other competitors such as Sun
and IBM have web service platfonn, in addition to their suite of services.
6. EXAMPLE OF A SERVICE
At the other end, a large collection of web services is burgeoning around

the Web. As an example, http://www.xmethods.com/provides a listing of
a number of web services ranging from FedExlUPS package tracking
service, Google Search, SOAPSMS, weather service to price watcher on
auctions.
What are the steps involved in creating and using web services? We will
illustrate development of a web service using the Java platfonn as an
example. The components we need are a web server (e.g. Apache web
server) with support for the SOAP protocol. XML parser is also necessary in
order to parse XML messages (example WSDL documents). Figure 42

shows a block diagram of the sequence of steps involved in providing and
requesting a web service.
UDDI
Publish Service (SOAP) Service
~
Web
Invoke
Service
-
....-
Web Service
- A~
1
(SOAP)
>-- .. _ .. _ .. _ .. -
WSDL
Document Web
Y'etchServicl
~escription
Service -
Lookup SerWce
Requester
·
Service Provider
. (SOAP)
Service Requester
Figure 42. Overview of Web Service Usage.
6.1 Steps in the creation of a Web service:
The steps that a web service provider should perfonn in order to make a
web service available are:
• Implement Web service functionality

• Register Web Service with UDDI Service
• Publish WSDL of service for access
• Provide SOAP access to Web service (e.g. via web server)
• Wait for SOAP requests from clients
• When client request arrives, process and return SOAP
response.
6.2 Client invocation of Web service:

In order to use a web service, a web service requester should perfonn the
following steps:
• Send a SOAP request to UDD! to lookup web service

• Retrieve the WSDL for the web service
• Invoke the web service with a SOAP request with the
appropriate input parameters.
• Process the SOAP responses (e.g. handle SOAP
exceptions), and use the output parameters in the response
In WSDL documents, services are defined as groups of network

endpoints or ports. This abstract notion of endpoints is distinguished from
the actual data formats and network protocols. Messages describe the data
exchanged at an abstract level, port types describe the operations, and
binding specifies the data formats and protocol level details. As an example,
lets consider the WSDL document shown in Table 52.
The web service description is an XML document, as can be seen in the

example (Table 52). The element 'definitions' embeds all the information
about the service. The end point of the service is specified within the
<service> element. Thus, the location where the service is accessible is
specified in the <soap:address> tag with the location attribute. The types of
the input and output data can be specified in the <types> element, which are
then embedded in the <message> elements to describe the various messages
that can be passed between the client and the service provider. In the
example above, we have defined two messages 'ExampleWebSearchInput'
and 'ExarnpleWebSearchOutput'. The semantics of the message passing is
described in the <portType> element; thus, 'ExampleWebSearchType'
includes an operation called 'GetSearchResults' that uses
'ExampleWebSearchInput' as the input message and
'ExampleWebSearchOutput' as the output message. The <binding> portion
of the document specifies the encodings and the transport. In the example
above, the transport is HTIP, although other means of transport include
HTIPS, and SMTP.
Table 52. Example WSDL document

<?xml version=" 1.0"?>
<definitions name="ExampleSearchService"
targetNamespace="hnp://example.comlexample.wsdl"
xmlns:tns="http://example.comlexample.wsdl''
xmlns:xsd1 =''http://example.comlexample.xsd"
xmlns:soap=''http://schemas.xmlsoap.org/wsdllsoap/''
xm1ns="http://schemas.xmlsoap.org/wsdll''>
<types>
<schema targetNamespace="http://example.comlexample.xsd''
xmlns=''http://www.w3.org/2000/10IXMLSchema">
<element name="SearchlnputQuery" type="string"l>
<element name="SearchOutput" type="string"l>
<lschema>
<ltypes>
<message narne="ExampleWebSearchInput">
<part name="body" element="xsd1:SearchInputQuery"I>
</message>
<message name="ExampleWebSearchOutput">
<part name="body" element="xsd1:SearchOutput"I>
</message>
<portType name="ExampleWebSearchType">
<operation narne="GetSearchResults">
<input message="tns:ExampleWebSearchInput"l>
<output message="tns:ExampleWebSearchOutput"1>
<loperation>
</portType>
<binding name="SearchServiceBinding" type="tns:ExampleWebSearchType">

<soap:binding style="document" transport=''http://schemas.xmlsoap.org/soaplhttp''l>
<operation name="GetSearchResults">
<soap:operation soapAction=''http://example.comlGetSearchResults''l>
<input> <soap:body use="literal"1> </input>
<output> <soap:body use="literal"l> </output>
</operation>
</binding>
<service name="ExampleSearchService">
<documentation>Example Search Service <ldocumentation>
<port name="SearchServicePort" binding="tns:SearchServiceBinding">
<soap:address location="http://example.comlexamplesearch"1>
<lport>
<lservice>
<ldefinitions>
For the above service, the client needs to first locate this service, retrieve
the WSDL document (shown in the table), and build SOAP messages with
the appropriate input values. Then the client invokes the service (for
example by sending the SOAP message over HTTP to the location specified
in the service location), and evaluates the SOAP response (if any) from the
service provider.
7. LIMITATIONS
There are many issues that need to be addressed in the current

framework of web services. In this section, we'll outline some of the
issues:
• Network Reliability: Web service infrastructure depends on

network reliability. Let us consider the following example: a
client invokes a service. The server sends an ACK but this fails
to get returned to the client on time. The client side times out,
and resends the service request. Such duplication of service
requests may result in unwanted duplication of the service
itself.
• Atomicity of web services: A web service or operation is
usually a single step in a whole process of services. Such
sequences of operations can be called "web conversations".
How do we enforce constraints on the order and dependencies
of such distributed web services? How do we enforce atomicity
of sequences of operations, such as operation A should be
followed by operation B. Any failure should result in voiding
the whole sequence of operations. Such complexities in a
distributed environment are still open problems in web service
architectures.
• Dynamic Discovery of services: How do we build a system
where services available can be discovered dynamically? The
key aspect of such a problem is in the semantics and
standardization of the service definition. WSDL describes the
functionality of the service, and UDm specifies the business-
centric view of the service, but another layer of specification is
needed for inspecting descriptive repositories about web
services. The Web Service Inspection Language (WSIL) is a
step in that direction. With WSIL, service providers can make
easily accessible descriptions of services provided by that
entity. Thus WSIL is used only to "advertise" the web services.

The Semantic Web effort [ http://www.w3.orgiSemantic ] is a
more general effort on the problem of adding semantics to
Web resources. Many of the standards we have discussed in
earlier chapters have covered the syntactic aspects, but seldom
addressed semantic aspects. The semantic web effort defines
the Resource Description Framework (RDF) is a language for
representing information on the World Wide Web. The goal of
RDF is to define a common framework for representing meta
information that can be exchanged between different
applications or agents.
8. CONCLUSION
In this chapter, we have studied the popular notion of a "web service".

By itself, web service is not anything new, but the standardization efforts
open up the potential of widespread service access to a wide variety of
developers and systems. The key aspects of web services are service
discovery, remote execution, ease of integration, and platform
independence. UDDI service enables a centralized mechanism of
registering services, and supports service discovery. The description of
services are specified using WSDL documents which enables the
definition of the operations allowed in the service, the input and output
message types, and the binding of the operations to specific transport
protocols and encodings. SOAP is an emergent standard protocol that
operates on top of HTTP in order to achieve message exchange in a
distributed, decentralized environment.
What is the future of web services and how will it affect the Web?
Since web services are still at an inception stage, this is difficult to predict.
But, it is not hard to imagine a set of specialized function providers who
defme services that are clearly specified, and a large set of application
developers who consume the services and integrate seamlessly with the
applications developed. It also potentially brings together a large
collection of services with a large impact on usability and redundancy of
applications. Thus, there can be a number of applications or web sites, but
they may share user databases from key providers, rather than every site
having its own infrastructure (e.g. for user identification). Another area of
impact would be the mobility of services over a wide variety of platforms,
and client devices. What we may also see is a large-scale development of

client applications, rather than just the popular web browser. Although,
there is really no fundamental paradigm shift in web services, the impact
of the standardization, and the ease of integration with a vast number of
services can potentially alter the landscape of the Web as we know it
today.
FURTHER READING
Most of the efforts on Web services are standardization efforts. In that
context, standards published by organizations such as the W3C Web Service
group ( http://www.w3c.org/2002/ws ) or the Web Services Interoperability
Organization ( http://ws-Lorg/ ) are good resources for keeping up to date
with the standards such as SOAP, and WSDL. Platform specific
implementation guides and books for the leading platforms such as .NET,
SunONE, and IBM's WebSphere are abundantly available, and are good
sources for detailed descriptions of specific programmatic ways to integrate
web services with applications.
EXERCISES
1. Define an example service using WSDL.
2. Build a web server SOAP over HTTP interface to the web service
described in question 1. (You may want to use available software such as
apache SOAP/Java software to achieve this).
3. Implement a simple client that invokes the service, and displays the
result in the form of a web page.
4. Discuss the potential advantages and disadvantages of the web service

framework.
5. It is an interesting problem to devise sequence of web operations that

are bundled into one atmonic operation. Discuss how the current protocols
can be extended to specify sequences of service operations that are
distributed from different service providers. Implement a framework that
allows a sequence of two operations as a single service with exchange of
data between the two services. How can atomicity of the combined operation
be enforced?
Chapter 10
Conclusion
I leave no trace ofwings in the air, but I am glad I have had myflight
1. REVIEW
In this book, the technology behind building Web applications was

covered. The fundamental areas discussed include data markup, networking,
and information retrieval. Data markup plays an important part in the
development of Web technology. With a vast amount of information being
exchanged across the Web, it is important that the data be structured and be
transformable from one format to another. We have presented the extensible
Markup Language (XML) and the extensible Style Sheets (XSL) as methods
of achieving the above objectives. The advantage of the XML approach is
the formal definition of the structure of the document either using a DTD or
a Schema, which allows syntactic validation of the document. The Hyper
Text Markup Language (HTML) originated from SGML, but has been
streamlined into XML technology and later versions of HTML (XHTML)
have document structures formally defined using a DTD or Schema, thus
providing a common language for publishing information on the Web. Style
sheets provide a mechanism of defining the presentation layer for the
corresponding HTML document. XSL also serves the purpose of
transforming data representations from one format to another. This is
especially useful in business to business (B2B) exchange of information, in
addition to presentation ofXML documents.
In the networking chapter, the concept of endpoint location identifier was

presented. In the Internet Protocol (IP), addresses are specified using the 4-
byte dotted notation (and 16-bvtes in IPv6). The Domain Name System

(DNS) implements a distributed name management system that allows name

servers to map domain names to specific IP addresses. The TCP/IP protocol
suite is the underlying Internet protocol suite that allows computers to
communicate with each other connected on a computer network. The Web is
architected as a distributed client server infrastructure. The Hyper Text
Transfer Protocol (HTTP) is a common protocol used by clients and servers
to communicate with each other. Proxy servers serve the purpose of caching
documents and distributing the load across many servers. Web Security is an
important aspect of the Web. Secure HTTP is a protocol based on Secure
Sockets Layer (SSL) in order to enable secure communication between
clients and servers using public and private key encryption schemes. A third
party (e.g. certificate authority) authenticates the client/server using digital
certificates.
With such a vast amount of information distributed across the Web,

methods of searching and indexing such information is an integral part of
Web applications. Information retrieval provides the base technology for
indexing and searching the Web. The chapter on information retrieval
focussed on text analysis and retrieval methods. The steps of pre-processing
text, creating indices and storage of textual documents were presented.
Methods of searching through the text databases, and ranking of the
retrieved documents were also illustrated. Techniques such as Latent
Semantic Indexing (LSI) are designed to address high-dimensionality of the
document space. Retrieval accuracy is often measured quantitatively using
the precision-recall graph.
On the application side, a number of diverse Web technologies have been

covered. The basic architecture of a Web search system was discussed.
Methods of crawling the Web, and identifying "important" documents were
illustrated. The architecture of a Web directory system was elaborated, along
with discussions on semi-automatic generation of a hierarchical
classification taxonomy. Methods for improving the accuracy of the
retreived results using relevance feedback were also mentioned.
With the vast amount of data available on the Web, millions of users'
search for and access such information. Furthermore, the Web drives a lot of
other activities such as commerce, listings and media services. Every Web
application collects a large amount of information about users' access
patterns and interests. Tools and algorithms for mining such huge data sets
in order to extract useful patterns and trends are vital. Such extracted
patterns can be applied in a variety of ways, ranging from improving the user
interface, presenting items of more interest to the users' and enhance the
application itself. The basic algorithms of data mining such as association

mining, classification, sequence mining and clustering were presented and
their applications to the Web discussed.
Communication and commerce applications are prominent applications

on the Web. Communication products such as e-mail and instant messaging
drive billions of messages across the Internet every day. Electronic mail
protocols have been defined many years ago, but continue to be a vital
application on the Web. The common E-mail protocols such as SMTP, POP
and IMAP have been covered. Next, an overview of an Instant Messaging
architecture is discussed. In the second half of the chapter on
communications and commercem, E-commerce technologies were
presented. The set of e-commerce platforms, and standards such as IFX were
first presented. Next, the basic components that make a electronic commerce
system were summarized. Components such as billing, reporting, and proper
fraud control were identified as key to build a successful e-commerce
infrastructure.
There has been lot of interest in the integration of mobility with Web
access. In particular, access to Web content such as e-mail, fmance stock
quotes, and news are available on a variety of wireless devices. This enables
access to information anywhere, anytime. As an illustration, we presented
how the mobile infrastructure interoperates with the Web, and discussed the
WAP protocol suite. Wireless markup standards were also presented with an
overview of the Wireless Markup Language (WML). Short messaging
service is also an important aspect of wireless communication, and can be
integrated closely with Web applications. Methods of generating the wireless
content from Web servers were also illustrated with examples. Emerging
trends, such as utilization of user location information to enhance the mobile
web service, increased bandwidth, and enhanced user interfaces, are key to
the widespread growth and adoption of this fertile area.
An important type of web architecture that is gaining interest is the

notion of "web services". In the web service framework, services can be
registered in a central registry such as the UDDI, along with service
descriptions. Applications that need to use this service locate the service
from the UDDI, and use transport layer protocols such as SOAP in order to
use the service by remotely invoking the service from the server providing
the service. Web services are designed on the principles of platform
independence, remote execution and ease of integration. One of the key
aspects of the web services approach is the standardization of service
description, and access that could lead to a wide spread adoption of the web
services infrastructure.
2. SYSTEM DESIGN OVERVIEW
We discussed a wide variety of web applications. Are there any design

principles that resonate among the different architectures? Web applications
follow design principles that govern the development of any large-scale,
robust, fault-tolerant, real-time system. We highlight the key fault-tolerance
aspects of such a design:
• Distributed
A distributed architecture is a key component of large scale deployment.

This is especially true when there is no compelling need for a "solution in a
single box". By distributing the functionality into separate modules, it is
easier to reliably maintain and identify problems in order rectify the
problems quickly.
• Redundancy
The system should not have any point of failure. This is a very key aspect
of a live system that should be up 100%13 of the time. For instance, if we
have two servers going through a switch that distributes requests between
the two servers, then the two systems share the load. If one server goes
down, then all traffic is routed to the other server. Once the system that's
down is restored, then both continue servicing the end user. Note that, to the
end client who is being serviced, there is no perceived downtime or
unavailability of the service. Thus, it is important to have redundancy and
provide the ability for another component to service the client in case of a
failure.
• Ability to Scale
Another important aspect of building scalable systems is the ease of

ability to scale. Well-designed functionally distributed systems will be
flexible enough to handle a larger amount of load (e.g. traffic, users or
13 Practical uptimes are in the high nineties (e.g. 99.9% availability).

transactions) just by simply adding more hardware, memory, or other

computing resources.
• Monitoring, Logging, and Debugging

In order to have a 24/7 real-time application up and running, mechanisms
of handling failures, recovery strategies, and escalation procedures in case of
errors need to be established. Furthermore, the ability to monitor certain
servers, requests, or users in order to debug or resolve issues with the system
is important. Logging of access, transaction and other data is useful both for
reporting, and data mining. Realtime monitoring, and automated
mechanisms for the escalation of system problems also need to be
implemented.
• Efficient transaction handling

Many commerce systems involve transactions. Since any component that
involves the exchange of monetary units is susceptible to audits, and
reporting requirements, it is vital to properly handle such transactions. It is
important that built-in mechanisms for retaining records of the transactions,
recovering and rolling back transactions exist in case of system failures. The
transaction storage should also provide ability to be distributed, redundant
and recoverable.
• Distributed databases
An important aspect of Web applications is the large amount of data that
needs to be stored, managed, and manipulated, whether it is storage of user
profiles, or shopping inventories and so on. This data should be distributed
in a scalable and replicated manner. Furthermore, redundancy and recovery
in case of failures is a crucial aspect of the system design in order to give a
no downtime experience to the end users'. Replication, consistency and
redundancy of databases ensure that data storage and access are not points of
failure in the system.
• Data Mining & Reporting

The Web generates billions of page views and the storage of such traffic
and user navigation information takes gigabytes of storage. Efficient data
manipulation, storage and mining tools in the infrastructure need to be in
place to examine trends, usage patterns and integrated with reporting

modules to generate useful statistics about usage, and transaction details.
3. LIMITATIONS
• Network
Protocols for the Web cater towards a point-to-point client/server model.

Thus HTTP is designed for client server communication and not for multi-
party communication, or reliable event notification. Furthermore protocols
like HTTP are stateless, and even though state information can be stored
using cookies or assigning session IDs stored on server side, there is no basic
method of "sessions" on the Web. Another limitation of the current
technology is the focus on delivery of a message from one point to another.
Even though, protocols such as TCP provide the mechanism of
retransmission, there is no failsafe mechanism of actually transmitting a
packet from one point to another (e.g. in case of a server failure).
• Representation & Standardization
With XML, there has been enormous progress in the ability for business
to communicate with each other, or client applications to transfer data to
servers, or servers to exchange information with each other. Yet, there is a
lot more that needs to be done. Standards are emerging, and different groups
adopt different standards making it difficult to converge on a common set of
standards. In some case, it may not be even meaningful to have a single set
of standards.
• Security
Security is the heart and soul of the Web. Without security, commerce
and personalization on the Web is meaningless. The most common approach
to security is the Public Key Infrastructure (PKI). But PKI is not without
limitations. To quote from [Ellison & Schneier 2000]: "security is a chain;
it's only as strong as the weakest link". What that means is that just by
having a valid certificate authority, or a very long key does not guarantee
security. How do you protect your own private key? How do you trust the
computer that verifies the certificate, which requires only public keys? A
name is generally associated with a certificate, but the security now lies in
knowing that the name corresponds to that person using the certificate. How
does the Certificate authority identify the holder of the certificate? Another
important aspect is that the user is not a part of the security design.
• Information
How do you search, retrieve and mine information contained in over two
billion documents and growing? Is it even reasonable to expect a set of
relevant web documents in response to a small query? What's the next step
beyond just text processing of information in the Web? How can we classify
documents, associate some "usefulness" or "semantics" to them, and provide
the ability to make certain documents relevant for specific users with
specific queries? Semantic Web effort addresses the issue of data markup of
documents on the Web for not human observation and analysis, but also for
machine agents. In order to expand the Web so that automated agents can
find, and extract "meaningful" information from the documents, two
frameworks are added on top of the XML framework. The first is a Resource
Description Framework (RDF) that allows encoding of "meaning". The
second are "ontologies"; an example of an ontology is the definition of a
taxonomy and a set of inference rules relating the objects in the taxonomy.
Many of these efforts are still in the nascent phase, and there are still many
unexplored issues in these areas.
4. THE FUTURE
Where is Web technology heading? By now, you must be convinced that

the Web is here to stay. There is no doubt that the Web is an integral part of
our life, and will continue to be a bigger part of our future. The actual form
and level of the Web will change, but the idea of a highly interconnected
digital society is a reality, and will continue to be so for a long time. In this
last section, let me share some personal thoughts about the future directions
of the Web.
• High Degree of Personalization

• Intelligent Filtering
• Proliferation of service based applications
• Centralized storage of user information
• Enhanced privacy and security
• Integrated e-commerce
• High-bandwidth content
• Rebirth of Mobile access and Internet appliances
• Intelligent Applications
We are already seeing an increased amount of personalization of

information on the Web. In the initial stages, users' were excited about the
large amount of information accessible through the web. They could search
and analyse all the information and extract the one best suited to their needs.
But with the rapid growth of information and services on the Web, users will
be overwhelmed by information and services. The current personalization
schemes such as maintaining personalized set of interests, portfolios and so
on are just the first steps in tailoring the information to the user's needs.
Such personalization will continue far more to a high degree of individually
tailored content and presentation of information and services.
The next area of promise is "intelligent information filtering" concept

that we discussed in the Web Mining chapter. The information should be
filtered "intelligently" in order to be genuinely useful, relevant and make it
easy for users' to access information or get services from the Web. The
problem of taking a vast amount of general and user specific information,
and combining various sources of this information into useful models of
filters is a challenging problem. In addition to the noisy samples in the data,
huge number of parameters, and lack of user specific training data, it is often
difficult to decide what is a useful or correct solution. Imagine a search
system that not only retrieves information based on the user's query but also
filters the results based on a user centric model. If done right, this could have
a profound effect on the precision/recall of the retrieval systems.
As we discussed, standardization efforts in inter-operability and ease of

integration with services on the Web are underway. With highly anticipated
rates of consumer adoption, we can see a proliferation of many applications
that tie and present in a novel manner various forms of web applications.
Many will be bundled into the device or operating system itself. The browser
will be a central application software to access basic information, but a
number of applications with specialized functionality will dominate user
access, much like an instant messaging client today. How does such a
transition affect the notion of a portal? The distinction between a personal
computing device, software application, and a portal or its services will fade
away. The Web will be transparently integrated with any computing
environment, or even at the operating system level.
With the adoption of web services, we will see more centralized efforts to
maintaining, storing, and protecting data such as user registration
information, wallet information, and personalization models. The notion of
every web site maintaining its own user registration database will be over-
ridden by a set of specialized backend web services that "maintain

customers". We are already seeing attempts towards consolidating
centralized services in the web service efforts. Perhaps, legislative changes
and governmental adoption of the Web may even help standardize the notion
of a user identity on the Web (albeit privacy protection advocates may worry
about the security of such information).
The need for centralized data storage will push the technology for
enhanced security and privacy measurements. Perhaps, the solution may also
lie outside cryptography, such as biometric techniques for enhanced security.
On the privacy front, client applications will become more sensitive on the
type and content of information that is exchanged. It is more likely that
much of the client or user tracking and data will have to be managed on the
server side, and client side information such as cookies may need to be
extensively enhanced in order to cope up with privacy and security
requirements.
Commerce will also need to transform into an integrated set of services,

perhaps once again centralized and transparent in some fashion. On the user
front, users' will prefer a personalized system, more influenced by
techniques such as collaborative filtering, and dynamic web site organization
of commerce items that are tailored to the users' interests and preferences.
Systems will become more robust in terms of ability to detect and eliminate
fraud, provide better quality of service, and highly tied integration with the
traditional brick-and-mortar shopping. We have already seen that bidding
systems are already popular, and such approaches will continue to sustain
interest with a large population of buyers and sellers.
Already streaming, and content-on-demand are very popular applications

on the Web. With the rapid growth of high-bandwidth networks, adoption of
services such as DSL (Digital Subscriber Lines), and emergent high-
bandwidth technology in wireless networks, the Web as an integral part of
media services is established. Slowly, but steadily, the Web as a platform for
broadcasting high-bandwidth streaming content is gaining a lot of merit,
along with the availability of cheap devices such as cameras for personal
computers.
With the vast number of devices already connected to the Web in one
manner or another, and with the compelling need or convenience of access
to web information anywhere/anytime, there will definitely be a resurgence
of web-centric mobile devices. They may not be tied to Web portals or
browsing per se, but rather built off general web services available for a
variety of tasks. With technological trends such as I-mode (always

connected to Internet mode), GPS systems, and high-bandwidth mobile
networks, the adoption and integration of a vast number of devices to "web
services" will be an integral part of future Web technology.
We have predicted a future of distributed, transparent set of

"intelligent" and personalized services on the web. Furthermore, web service
discovery, intelligent routing (for service search and discovery), and
semantic web efforts on automated agent "understanding" can transform the
Web as we know it into another whole new paradigm: "The Synaptic World
Wide Web" [Sarukkai 2002]. By synaptic web, I refer to the idea of a large
scale distributed computing system that exhibits a large amount of local and
global adaptability, much like a dynamic adaptive system, rather than a static
information retrieval system.
Appendix
A. Sample Server Program
A server is essentially a program that listens for incoming connections at

a particular port on a specific machine in a network. The steps in writing a
server program are as follows:
• Create a socket
• Bind that socket to an address/port
• Listen for client requests
• Upon client request, spin off a thread or a process to handle that
client.
• This client process can read and write to the socket, as needed.
• Repeat above steps.
The code for a simple server is shown below:
t*
Simple server program in C
Platfonn: UNIX
To compile, type:
gcc server.C -0 server
To run the server, just type 'server'
*t
#include <sysltypes.h>
#include <syslsocket.h>
#include <stdio.h>
#include <netinet/in.h>
#include <netdb.h>
#include <errno.h>
main 0{
int sock, c1ientsock;
struct sockaddr_in serverAddress;
int PortNo=O;
int child;
t* create socket *t
if( (sock = socket(AF_INET, SOCK_STREAM, 0» < 0) (
perror("Server cannot create socket.");
exit(1);
1** Set Port No to 24500 nl

PortNo=24500;
1* serverAddress holds the address and port infonnation *1

bzero( (char *) &serverAddress, sizeof(serverAddress»;
serverAddress.sinJamily = AF_\NET;
serverAddress.sin_addr.s_addr = htonl(\NADDR_ANY);
serverAddress.sin.JlOrt = htons(PortNo);
1* Bind this address and port to this socket nl

if( bind (sock, (const struct sockaddr *) &serverAddress, sizeof(serverAddress») (
perror("bind failed.");
exit(I);
1* This sets the listen queue for this socket to maximum of 5 clients nl
listen(sock, 5);
In Server now waits indefinitely for client requests n I

for(;;){
In Is there an incoming client request nl
if( (clientsock = accept(sock, 0, 0» < 0)
perror("client socket failed.");

exit(I);
In For each client request, fork to have another process service that client nl
In Other options include spawning a separate thread nl
if( (child = fork(» < 0)
{
perror("Error create child.");
exit(I);
else
if (child = 0)
( In if child is 0, then this is the process servicing the request n I
close(sock); In Doesn't require parent socket nl
ServeRequest(clientsock);
close(clientsock);
exit(O);
/** This is the main server process code **/

close(clientsock);
close(sock);
#define BUFFERLEN 40
int ServeRequest(int clientsock)
(
char bufflBUFFERLEN];
int msgLength;
bzero(buff, BUFFERLEN);
/** Receive message from client **/
if( (msgLength = recv(clientsock, buff, BUFFERLEN, 0» < 0)
perror("Error receiving data");

exit(l);
}
printf("Got this message from client: Message was: %s\o\o", bum;
B. Sample Client Program
A client does the following steps:
• Create a socket
• Connect to a server at a specific port
• If connection was successful, send data, or read data as required.
A simple client C code is shown below:
/*
Simple client program
To compile on UNIX, type:
gcc -Isocket -Insl client.c -0 client
To run, type:
client <hostname> <port>
The message "This is a test" will be sent to the server.
*/
#include <sys/types.h>
#include <sys/socket.h>
#include <stdio.h>
#include <netinetlin.h>
#include <netdb.h>
#include <ermo.h>
main (int argc, char ** argYl
int sock;
struct sockaddr_in serverAddress;
struct hostent *hstp, *gethostbyname();
const char *message = "This is a test";
if(argc < 3)
printf ("USAGE: client hostname port\n");

exit(I);
}
if( (sock = socket(AF_INET, SOCK_STREAM, 0» < 0)

(
perror("Error getting socket.");
exit(I);
bzero( (char *) &serverAddress, sizeoQserverAddress»;

serverAddress.sinJamily = AF_INET;
j** Get network host entry **j
hstp = gethostbyname(argv[I));
bcopy(hstp->h_addr, &serverAddress.sin_addr, hstp->hJength);
serverAddress.sinyort = htons(atoi(argv[2]));
j** Connect to server **j

if(connect(sock, &serverAddress, sizeof(serverAddress» < 0)
perror("Client error connecting to server.");

exit(l);
j** Send test message **j

if (send(sock, message, strlen(message), 0) < 0)
perror("Error sending message.");

exit(l);
c1ose(sock);
printQ"Successfully sent test message\n");

exit(O);
Socket programming is supported in most languages such as Java. For

example, the java.net package contains socket objects and functions.
Another function that is useful in waiting for data over sockets is the "select"
function. The select function allows a process to wait on multiple file
descriptors l4 • Thus, multiple client sockets can be created (using accept), and
the select function can be used to query the status of multiple file descriptors
at the same time. The select function times out if there are no changes to the
descriptors waited upon.
14 In UNIX, a socket connection is also a file descriptor.

c. Apache Software
One of the most commonly used servers lS are from the Apache Software
Foundation (http://www.apache.org ). Some of the key software include the
HTTP web server, Jakarta (server side software for Java Platforms),
modyerl (which allows the ability of writing Perl modules), PHP (server
side scripting language that can be embedded in HTML for dynamic content
generation), XML Project (Xerces- that contains the XML parser, SOAP
parser, & Xalanan). Instructions for downloading, compiling and installing a
HTTP server can be found in the website.
Once compiled, the apache server can be started on the machine. The
apache UNIX HTTP server program is called "httpd". The server can be
started by simply executing this command. The httpd.conf file specifies
various directives to the apache server. For instance, the 'Listen' directive
indicates which port the server should listen on. If successful, then the
request for the page ''http://localhost/'' with a web browser on the machine
you are running the server on should show a valid Apache page. Similarly,
apache server is configurable on Windows platforms.
The Apache XML parser is available at http://xml.apache.org/ . The

Xerces XML parser is available in C++, Perl or Java, and can be
downloaded and installed. This is useful for exercises on XML parsing, or
anywhere XML document structures need to be parsed and specific parts
extracted. An example is the SAXCount.java program where a XML Parser
object is created by extending the SAX Handler base. The callback functions
for the SAX parser can be used to print the document content. The DTD can
be specified in the XML document, and will be loaded. The parseO method
will use the DTD structure to parse the input XML file.
15 Some surveys estimate that 54% of the servers on the Web use Apache.
D. Example CGI Program
This section describes how to create a simple CGI program in Perl with
Apache server. In the httpd.conf file, we can add some directives to enable
CGI requests to the server. The directive:
ScriptAlias /cgi-bin/ /usr/locallapache/cgi-bin/
instructs the apache server to map any request with the prefix /cgi-binl to
the path /usr/locallapache/cgi-bin/. Thus, the request http://localhostlcgi-
bin/test.pl will actually result in invoking the CGI perl program that resides
in /usr/local/apache/cgi-bin/test.pl. The test.pl PERL program will then
process the input request and generate (perhaps dynamic HTML) pages for
display to the user.
The two basic things that your CGI program must do in order to be
properly processed is to return a MIME header, and then write out response
in HTML. The MIME header indicates to the client browser what type of
data to expect. A common header is:
Content-type: textlhtml
In order to write your first perl CGI program, edit a file called test.pl and
type the following contents into that file: .
# !/usrlbin/perl
print "Content-type: textlhtml\r\n\r\n";
print "Hello, World.";
Make sure that this is saved in the path that the /cgi-bin/ is aliased to (with
ScriptAlias). Thus, if the above program is saved as /usr/locallapache/cgi-
bin/test.pl, and the apache config has the ScriptAlias that we mentioned
earlier, then a request to http://localhost/cgi-bin/test.pl should generate the
content "Hello World" on the browser. There are many modules in Perl that
have many specialized CGI functionality built-in, and the user can refer to
one of the many books on learning Perl, and applying it to write CGI
programs. Some useful guides to Perl are [Schwartz & Pheonix 2001] and
[Guelich et al 200I].
E. Java Web Service Example
The Web service development pack from Sun is the Java Web
Services Developer Pack Early Access 2 (JWSDP EA2) release. This
includes Java APls for XML messaging, XML Processing,
Registries, XML Remote Procedure Call (RPC), and other tools.
JAX-RPC are Java APls that enable the user to perform Remote
Procedure Call (RPC) and implement the web service using Java.
The steps involved in defining a Java based Web service are as
follows:
a) Write a Service Method Class

This is also called the service endpoint interface description.
Generally this extends the java.rmi.Remote interface, and its methods
must throw the java.rmi.RemoteException. An example is shown
below:
public interface ExampleSearchService implements java.rmi.Remote
public String getSearchResults(String userquery)

throws RemoteException;
Thus, the service name or port is ExampleSearchService, and the

operation allowed is getSearchResults with the appropriate inputs and
outputs (in this example Strings).
b) Write a Service Implementation Class

Next we'll write the implementation for the classes.
import java.xml.rpc.server.ServiceLifecycle;
public class ExampleSearchServicelmpl implements ExampleSearchService, ServiceLifecycle
public void init(Object context) {

II appropriate init code
public void destroyO (

Ilrelease appropriately
public String getSearchResults(String userquery) {

Ilget the search results using the query
}
}
Through the ServiceLifecycle interface, the lifecycle of the service is

managed. The initO and destroyO methods are inherited from the
ServiceLifecycle interface.
c) Define configuration file used for Java-WSDL mapping

The configuration file specifies the service name, and its endpoint
interface and class. This is used by another tool called xrpcc in order
to generate the WSDL from the configuration file.
<?xml version=" 1.0" encoding="UTF-8"?>

<configuration xmlns=''http://java.sun.comljax-rpc-ri/xrpcc-config''>
<rmi name="TestPackage"
targetNamespace=''http://test.orglwsdl"
typeNamespace=''http://test.orgitypes''>
<service name="ExampleService"
packageName="TestPackage">
<interface name="ExampleSearchService"
servantName="ExampleSearchServicelmpl"l>
</service>
</rmi>
</configuration>
d) Define a XML file for deployment in a servlet container.

The last component is the definition of the web.xml file for the
deployment of the service in a servlet container, generally packaged
into a WAR file. We refer the reader to the Sun developer site for
further description of servlets and packaging as a WAR file.
In order to deploy the service, the service endpoint interface and

implementation classes are compiled. Next the xrpcc tool is executed
using the config file, which generates the stub, tie 16, and other c1ient-
side and server-side artefacts required by the JAX-RPC runtime.
Lastly this is bundled and deployed with a servlet container. If the
service is deployed in a server called "testserver" at port 8080, then
the request to the service will be at
http://testserver:8080IExampleService/jaxrpc . At this URL, the
16 Stubs and ties are intermediaries that enable communication between a service endpoint
and a service client.
description of the service is displayed including the various ports,

WSDL's, and how to invoke each of the service ports.
On the client side, it is easy to invoke the service using the client stub
for the service, such as:
ExampleSearchService_Stub stub =
(ExampleSearchService_Stub)
(new ExampleSearchService_ImpIO.getExampleSearchServicePortO);
stub._setProperty(Stub.ENDPOINT_ADDRESS_PROPERTY,
"http://testserver:8080IExampleServiceljaxrpclExampleServiceIF'');
String SResults = stub.getSearchResults(queryString);
The Sun Java developer site http://developer.java.com/is a useful

site for developers', including the Java web service platfonn. A good
tutorial article on this site is one entitled "Getting Started with JAX-
RPC" by Arun Gupta, and Beth Stearns.
References
[Agarwal et al1998] R. Agarwal, 1. Gehrke, D. Gunopulos, and P. Raghavan, Automatic
Subspace Clustering of High Dimensional Data for Data Mining Applications. Proc. of
1998 ACM-SIGMOID IntI. Conf. Management of Data (SIGMOID'98), pp:94-105.
[Aggarwal et al200l] C. C. Aggarwal, F. AI-Garawi, and P. S. Yu, Intelligent Crawling on

the World Wide Web with Arbitrary Predicates, Proc. Of the 10th IntI. World Wide Web
Conference (WWWI0), May 1-5 2001.
[Andersson 2001] Christoffer Andersson, GPRS and 3G Wireless Applications: Professional

Developer's Guide, John Wiley and sons, 2001.
[Ankerst et al1999] M. Ankerst, M. Breunig, HP Kriegel, and J. Sander, OPTICS: Ordering

Points to identify the Clustering Structure, Proc. 1999 ACM-SIGMOID Conf. On
Management of Data (SIGMOID'99) pp:49-60.
[Baeza-Yates 1999] R. Baeza-Yates, and B. Ribeiro-Neto, Modern Information Retrieval,

Addison-Wesley, 1999.
[Ballard 1999] Dana H. Ballard, An Introduction to Natural Computation, MIT Press, 1999.
[Barford & Crovella 1999] Paul Barford and Mark E. Crovella. Measuring Web performance
in the wide area. Performance Evaluation Review, Special Issue on Network Track
Measurement and Workload Characterization, August 1999.
[Beckmann et al1990] N. Beckmann, HP Kriegel, R. Schnieded, and B. Seeger, The R* Tree:

An Efficient and Robust Access Method for Points and Rectangles. Proc. 1990 ACM
SIGMOID IntI. Conf. Management of Data (SIGMOID'90), pp:322.331.
[Ben-Shual et al 1999] Israel Ben-Shaul, et ai, Adding support for dynamic and focused
search with Fetuccino, in Proc. of the Eight World Wide Web Conference (WWW'8),
1999.
[Bharat & Broder 1998] Krishna Bharat and Andrei Broder, A technique for measuring the
relative size and overlap of public Web search engines, in Proc. of the Seventh World
Wide Web Conference (WWW'7), 1998.
[Bharat & Henzinger 1998] K. Bharat and M. Henzinger, Improved Algorithms for Topic
Distillation in a Hyperlinked Environment, Proc. Of the ACM SIGIR conference, 1998.
[Bharat et al 1998] Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and
Suresh Venkatasubramanian, The Connectivity Server: fast access to linkage information
on the Web, in Proc. of 7th World Wide Web Conference, 1998.
[Bhuyan et al1991] Jay N. Bhuyan, Jitender S. Deogun, and Vijay V. Raghavan, Cluster-
based adaptive information retrieval, in the Proc. of the 24th IntI. Conference on System
Sciences- Architecture and Emerging Technology Tracks, vol. I, pp:307-316, Jan. 1991.
[Boumphrey et a120oo] F. Boumphrey, D. Raggett,1. Raggett, T. Wugofski, c. Greer, and S.

Schnitzenbaumer, Beginning XHTML, Wrox Press, March 2000.
[Bowman et al 1994] Bowman M., Danzig P., Hardy D., Manber U., Schwartz M., and
Wessels D., The Harvest Information Discovery and Access System, in Proc. of Second
IntI. World Wide Web conference, 1994.
[Bradley et a11998] P. S. Bradley, U. M. Fayyad, and O. L. Mangasarian, Data Mining:

Overview and Optimization Opportunities, Microsoft Research Technical Report MSR-
TR-98-04, 1998.
[Breese et a11998] John S. Breese, David Heckerman and Carl Kadie. Empirical analysis of
predictive algorithms for collaborative filtering. In Proceedings of the Fourteenth Annual
Conference on Uncertainty in Artificial Intelligence, pages 43--52, July 1998.
[Brieman et a11984] Brieman L., J.H. Friedman, R. Olshen, C.J. Stone. Classification And
Regression Trees, Wadsworth, Belmont CA (1984).
[Brin & Page 1998] Sergey Brio and Lawrence Page, The Anatomy of a Large-Scale
Hypertextual Web Search Engine, in the Proc. of the Seventh World Wide Web
Conference (WWW'7), 1998.
[Brin 1998] S. Brin, Extracting Patterns and Relations from the World Wide Web, in WebDB
workshop at 6th IntI. Conference on Extending Database Technology (EDBT'98), 1998.
[Cache Digest] http://www.squid-cache.org/
[CARP 1998] V. VallopiIlil, and K.W. Ross, Cache Array Routing Protocol vl.O,
http://icp.ircache.net/carp.txt,Aug. 1998.
[CEN/ISSS 200 I] Summaries of some Frameworks, Architectures, and Models for Electronic
Commerce, revision 1.a, Oct. 2001, CEN/ISSS Electronic Commerce Workshop, 200 I.
[Chakrabarthi et a11998a] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and

S. Rajagopalan, Automatic Resource Compilation by Analysing Hyperlink Structure and
Associated Text, in Proc. Of 7th World Wide Web Conference (WWW7),ApriI1998.
[Chakrabarti et a11998b] S. Chakrabarti, Byron Dom, Rakesh Agarwal, and Prabhakar

Raghavan, Scalable feature selection, classification and signature generation for
organizing large text databases into hierarchical topic taxonomies, in The VLDB Journal,
7: 163-178, 1998.
[Chakrabarthi et a11999] S. Chakrabarthi, M. Berg, and B. Dom, Focussed Crawling: A New

Approach to Topic Specific Resource Discovery, in Proc. Of the 8th IntI. World Wide Web
Conference (WWW8), 1999.
[Chekuri et a11997] Chandra Chekuri, Michael H. Goldwasser, Prabhakar Raghavan, and Eli
Upfal, Web search using automatic classification, in Proc. Of the 6th World Wide Web
Conference (WWW'6), 1997.
[Chen & George 2000] Y.H. Chen, and E. I. George, A Bayesian Model for Collaborative
Filtering (2000) (Department of MSIS, University ofTexas at Austin)
[Cheung et al1997] D. W. Chueng, B. kao, and J. W. Lee, Discovering user Access patterns
on the World-Wide Web, Proc. First Pacific-Asia Conference on Knowledge Discovery
and Data Mining (PAKDD-97).
[Cho et aI 1998] Junghoo Cho, Hector Garcia-Molina, and Lawrence Page, Efficient crawling
through URL ordering, in Proc. of 7th World Wide Web conference, 1998.
[Cho & Garcia-Molina 2oo0a] Junghoo Cho, Hector Garcia-Molina, The Evolution of the
Web and Implications for an incremental Crawler, In Proceedings of 26th International
Conference on Very Large Databases (VLDB), September 2000.
[Cho & Garcia-Molina 2000b] Junghoo Cho, Hector Garcia-Molina Synchronizing a database
to Improve Freshness, In Proceedings of 2000 ACM International Conference on
Management of Data (SIGMOD), May 2000.
[cHTML] http://www.w3.org/TRII998/NOTE-compactHTML-19980209/ Compact HTML

for Small Information Appliances, W3C NOTE 09-Feb-I998, Tomihisa Kamada.
[Clark 1999] David Clark, Preparing for a New Generation of Wireless Data, IEEE
Computer, pp:8-11, Aug. 1999.
[Croft et al1995] Croft, W. B., Cook, R., and Wilder, D., Providing government information
on the Internet: Experiences with THOMAS, in Proc. Of the Digital Libraries conference
(DL'95), 1995.
[De Bra et al1994] P. De Bra, G. J. Houben, Y. Kornatzky, and R. Post, Information retrieval
in distributed hypertextx, in Proc. of RlAO'94, Intel1igent Multimedia, Information
Retrieval Systems and Management, New York, NY, 1994.
[Deerwester et al 1990] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R.

Harshman, Indexing by Latent Semantic Analysis, J. Amer. Soc. Inform. Sci., 41, 391-
407,1990.
[Dempster et a11977] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum Likelihood
from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society,
Series B (Methodological), 39(1):1--38,1977.
[Dreilinger & Home 1997] Daniel Dreilinger and Adele E. Home, Experiences with Selecting
Search Engines using Meta-Search, ACM Trans. On Information Systems, vol. 15, No.
3,pp: 195-222, July 1997.
[Duda & Hart 1983] R. O. Duda, and P.E. Hart, Pattern Classification and Scene Analysis,
New York: Wiley, 1973.
[Edwards et a12001] J. Edwards, K. McCurley, and J. Tomlin, An Adaptive Model for

Optimizing Performance ofan Incremental Web Crawler, Proc. Of 10th World Wide Web
Conference (WWWIO), May 1-52001.
[Ellison & Schneier 2000] C. Ellison, and B. Schneier, Ten Risks ofPKI: What You're not
being told about Public Key Infrastructure, Computer Security Journal, Vol. XVI, No. I,
pp:I-8,2000.
[Ester et a11996] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, A Density Based Algorithm
for Discovering Clusters in Large Spatial Databases, Proc. 1996 IntI. Conf. Knowledge
Discovery and Data Mining (KDD'96), pp: 226-231.
[Fayyad et al 1996] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy

(Eds.), Advances in Knowledge Discovery and Data Mining, AAAIIMIT Press, 1996.
[Fisher 1936] R. A. Fisher, The use of multiple measurements in taxonomic problems, Ann.
Eugen., vol. 7,pp. 178-188, 1936.
[Foo et a120oo] Soo Mee Foo (Ed.), Ted Wugofski, Wei Meng Lee, Foo Soo Mee, Karli
Watson, Beginning WAP, WML, and WMLScript, Wrox Press, 2000.
[Frakes & Baeza-Yates 1993] Edited by William B. Frakes, and Ricardo Baeza-Yates,
Information Retrieval: Data Structures & Algorithms, Prentice Hall, 1993.
[Francis & Kucera 1982] Francis, W., and H. Kucera, Frequency Analysis ofEnglish Usage,
New York:Houghton Mifflin, 1982.
[Frost 2000] Martin Frost, Learning WML and WMLScript, O'Rielly 2000.
[Garfinkel et a12002] Simson Garfinkel, Gene Spafford, Debby Russell, Web Security,
Privacy and Commerce, O'Reilly & Associates, 2002.
[Garofalakis et a11999] John Garofalakis, Panagiotis Kappos, and Dimitris Mourloukos, Web
Site Optimization Using page Popularity, pp: 22-29, IEEE Internet Computing, July-
August 1999.
[Gauch et a11999] Susan Gauch, Jianying Wang, and Satya Mahesh Rachakanda, A corpus
analysis approach for automatic query expansion and its extension to multiple database,
ACM Trans. On Information Systems, vol. 17, no. 3, pp: 250-269, July 1999.
[Gibson et a11998] D. Gibson, J. Kleinberg, and P. Raghavan, Inferring web communities

from link topology, in Proc. of 9th ACM conference on Hypertext and Hypermedia, 1998.
[Goldberg et al 2000] K. Goldberg, T. Roed.er, D. Gupta, and C. Perkins, Eigentaste: A

Constant Time Collaborative Filtering Algorithm, UCB Electronics technical report,
MOO/41 , 2000.
[GSM-IEC] http://www.iec.orgltutorialslgsmlSee tutorials on Global System for Mobile

Communications.
[Guelich et a12001] Scott Guelich, Shishir Gundavaram, and Gunther Birznieks, CGI
Programming with Perl, O'reilly, 2001.
[Gudivada et a11997] Venkat N. Gudivada, Vijay V. Raghavan, William I. Grosky, and

Rajesh Kasanagottu, Infonnation Retrieval on the World Wide Web, IEEE Internet
Computing, pp:58-68, Sept/Oct. 1997.
[Gutherey 200 I] Scott Gutherey, Mobile Application Development: Using SMS and the SIM
toolkit, McGraw Hill, 2001.
[Hafer & Weiss 1974] Hafer, M., and S. Weiss,Word Segmentation by Letter Successor
Varieties, Iriformation Storage and Retrieval, 10,371-385,1974.
[Han et a12001] J. Han, M. Kamber, and A. K. H. Tung, Spatial Custering Methods in Data
Mining: A Survey, H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge
Discovery, Taylor and Francis, 2001.
[Hawking et al 1999] David Hawking, Nick Craswell, and Paul Thisdewaite, Results and
challenges in Web search evaluation, in Proc. Of the 8th World Wide Web Conference
(WWW'8), 1999.
[HDML] http://www.w3.orgITR/NOTE-Submission-HDML-spec.htrnl "Handheld Device

Markup Language Specification" submitted to the W3C 1997, Eds. Peter King, and Tim
Hyland.
[Heckennan et al 2000] "Dependency Networks for Inference, Collaborative Filtering, and

Data Visualization", Heckennan D., D. M. Chickering, C. Meek, R. Rounthwaite, and C.
Kadie, Jour. Of Machine Learning Research (JMLR), vol. I, pp: 49-75, 2000.
[Hersovici et a11998] M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, and S.

Ur. The shark-search algorithm --- An application: Tailored Web site mapping. In Proc.
7th intI. World-Wide Web Conference, 1998.
[Hezinger et a11999] Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, and Marc
Najork, "Measuring index quality using random walks on the web", in Proc. of the Eighth
World Wide Web Conference (WWW'8), 1999.
[Hori et al 2000] Masahiro Hori, Goh Kondoh, Kouichi Ono, Shin-ichi Hirose, and Sandeep
Singhal, Annotation-Based Web Content Transcoding, Proc. of the 9th IntI. World Wide
Web conference, Amsterdam, May 2000.
[HTCP 1999] P. Vixie, Vayu, HyperText Cache Protocol, http://icp.ircache.net/htcp.txt, Feb.

1999.
[ICP 1997] D. Wessels and K. Claffy, Internet Cache Protocol (v2), RFC-2816,
http://icp.ircache.net/rfc2186.txt ,Sept. 1997.
[IETF ECML 2001] J. M. Parsons, Electronic Commerce Modelling Langage (ECML):

version 2 specifications, IETF Internet Draft, 2001.
[IETF IOTP Req. 2001] D. E. Eastlake, Internet Open Trading Protocol: Version 2
Requirements, IETF Internet Draft, 2001.
[IETF Pay IOTP 2001] W. Hans, Y. Kawatsura, and M. Hiroya, Payment API for v1.0
Internet Open Trading Protocol, IETF Internet Working Draft, 200 I.
[Joachims et al1997] Thorsten Joachims, Dayne Freitag, Tom Mitchell, WebWatcher: A

Tour Guide for the World Wide Web, IJCAI'97.
[Karypis 2000] G. Karypis, Evaluation ofItem-Based Top-N Recommendation Algorithms,

U. Minnesota, Report #00-046,2000.
[Kassinen et al 2000] Eija Kaasinen, Matti Aaltonen, Juha Kolari, Suvi Melakoski, Timo
Laakko, Two approaches to bringing Internet services to WAP devices, Proc. of the 9th
IntI. World Wide Web conference, Amsterdam, May 2000.
[Kay 2001] M. H. Kay, XSLT Programmer's Reference 2nd Edition, Wrox Press Inc., 2001.
[Kleinberg 1998] J. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Proc.

ACM-SIAM Symposium on Discrete Algorithms, 1998, also see Jour. ACM, pp:604-632,
Sept 1999.
[Kobielus 2001] J. Kobielus, XML Based Specifications for Security Interoperability, The
Burton Group, Network Strategy Overview, June 2001.
[Krishnamurthy et al1999] B. Krishnamurthy, J. C. Mogul, and D. M. Kristol, Key

differences between HTTP/1.0and HTTP/I.l, Proc. 8th IntI. World Wide Web
Conference, Toronto 1999.
[Krishnamurthy & Rexford 2001] B. Krishnamurthy and J. Rexford, Web Protocols and
Practice: HTTP!}.}, Networking Protocols, Caching, and Traffic Measurement, Addison-
Wesley 2001.
[Lawrence & Giles 1999] Steve Lawrence and Lee Giles, Accessibility and distribution of
information on the web, Nature, 400, pp 107-109, 1999.
[MacQueen 1967] J. MacQueen, Some Methods for Classification and Analysis of

Multivariate Observations. Proc. of 5th Berkeley Symp. Math. Statist. Prob 1, pp: 68-75.
[Manber et al1997] Udi Manber, Mike Smith, Burra Gopal, WebGlimpse Combining
Browsing and Searching, Proceedings of 1997 Usenix Technical Conference, 1997.
[Mandelbrot 1983] B.B. Mandelbrot, The Fractal Geometry ofNature ,Freeman, New York,
1983.
[McBryan 1994] Oliver McBryan, GENVL and WWWW: Tools for Taming the Web, First
World Wide Web Conference, 1994.
[Melnik et a1200I) S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina, Building a

Distributed Full-Text Index for the Web, in Proc. Of the 10th intI. World Wide Web
Conference (WWWIO),May2-5 2001.
[Mercer 2001] D. Mercer, XML: A Beginners Guide, Osborne, McGraw Hill 2001.
[Meyer 2000) E. A. Meyer, Cascading Style Sheets: The Definitive Guide, O'Reilly and
Associates, 2000.
[Miller & Bharat 1998) Robert C. Miller, and Krishna Bharat, SPHINX: A framework for
creating personal, site-specific Web crawlers, in Proc. of 7th World Wide Web
Conference, 1998.
[Miller et al 1996) B. N. Miller, 1. T. Riedl, 1. A. Konstan, Experience with GroupLens:

Making Usenet Useful Again, Univ. Minnesota Report, 1996.
[Najork & Wiener 2001) M. Najork, and 1. L. Wiener, Breadth-first search crawling yields
high-quality pages, Proc. Of lOth Inti. World Wide Web Conference (WWWIO), May 2-5,
Hong Kong, 200 I.
[Park et a11995) 1. S. Park, M.S. Chen, and P. S. Yu, An Effective Hash-Based Algorithm for
Mining Association Rules, 1995 SIGMOD, pp. 175-186.
[Perkowitz & Etzioni 1999) Mike Perkowitz and Oren Etzioni, Towards Adaptive Web Sites:
Conceptual Framework and Case Study, in Proc. of 8th World Wide Web Conference
(WWW8), Toronto, 1999.
[Porter 1980) Porter, M.F. ,An Algorithm for suffix stripping, Program, 14(3), 130-137,
1980.
[Quinlan 1993) J. R. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, San
Mateo, CA, 1993.
[Raggett et al 1998) Dave Raggett, Jenny Lam, Ian Alexander, and Michael Kmiec, Raggett
on HTML4, Addison-Wesley, 1998.
[RFC 1939] J. Myers, and M. Rose, Post Office Protocol (POP) version 3, RFC 1939, 1996.
[RFC 2060] M. Crispin, Internet Message Access Protocol- version 4revl, RFC 2060,1996
[RFC 2683) B. Leiba, IMAP4 Implementation Recommendations, RFC 2683,1999.
[RFC 2821) J. Klensin (Ed.), Simple Mail Transfer Protocol, RFC 2821, 200I.
[Rissanen 1989] 1. Rissanen, Stochastic Complexity in Statistical Enquiry, World Scientific

Publ. Co., 1989.
[Rumelhart et al1987] D.E. Rumelhart, G.E. Hilton, and R.I. WiIliams, Learning internal
representations by error propagation, in Parallel and Distributed Processing, D. E.
Rumelhart, and D. McClel1and, Eds., vol. I. Cambridge, MA: MIT Press, 1987, pp. 318-
362.
[Sarukkai 2000] R. R. Sarukkai. Link prediction and path analysis using markov chains. In
Computer Networks, pages 1--6, June, 2000, also presented at the 9th World Wide Web
Conference (WWW9) Amsterdam.
[Sarukkai 2002) R. R. Sarukkai. The Synaptic World Wide Web. Manuscript in preparation.
[Sarwar et a12000) B. Sarwar, G. Karypis, J. Konstan, and 1. Riedl. Application of

dimensionality reduction in recommender systems--a case study. In ACM WebKDD
Workshop, 2000.
[Schafer et al2001] Schafer, J. B., Konstan, 1. A. & Riedl, 1. (2001), E-commerce

recommendation applications, Data Mining and Knowledge Discovery, 2001.
[Schwartz & Pheonix 2001] Randal L. Schwartz, & Tom Phoenix, Learning Perl (3rd
Edition), O'Reilly, 2001.
[SciAmer 1999] Members of the IBM Clever Project, Hypersearching the Web, in Scientific
American, June 1999.
[Shahabi et aI 1997) Cyrus Shahabi, Amir M. Zarkesh, Jafar Adibi, and Vishal Shah,
Kowledge Discovery from Users Web-Page navigation, IEEE RIDE 1997.
[Shardanand & Maes 1995) Upendra Shardanand and Pattie Maes, Social Information
Filtering: Algorithms for Automating Word of Mouth, Proc. OfCHI95, Denver, Colorado,
USA (May 7 - 11, 1995).
[Sheikholeslami et a11998] G. Sheikholeslami, S. Chatterji, and A. Zhang, WaveCluster: A

Multiresolution Clustering Approach for Very Large Spatial Databases, Proc. 1998 IntI.
Conf. Very Large Databases (VLDB'98), pp:428-439.
[SMS-IEC] http://www.iec.org/tutorialsl See tutorials on Short Messaging Services.
[Soboroff & Nicholas 1999] Ian M. Soboroff and Charles K. Nicholas. Combining content
and collaboration in text filtering. In Thorsten Joachims, editor, Proceedings ofthe
IJCAl'99 Workshop on Machine Learning in Information Filtering, pages 86--91,
Stockholm, Sweden, August 1999.
[Srikant & Agarwal 1995) R. Srikant, and R. Agarwal, Mining Generalized Association
Rules, in Proc. Of the 21" Conference on Very Large Databases, 1995.
[Srikant & Yang 2001) R. Srikant, and Y. Yang, Mining Web Logs to Improve Website
Organization, in Proc. Of the 10th IntI. World Wide Web Conference (WWWIO), May 1-5
2001.
[Srikant et a11997] R. Srikant, Q. Vu, and R. Agarwal, Mining Association Rules with Item
Constraints, in Proc. OfJ'd IntI. Conference on Knowledge Discovery in Databases and
Data Mining, Newport Beach, California, Aug. 1997.
[Stevens 1994] W. R. Stevens, The Protocols (TCPI/P Illustrated volume I). Addison-Wesley
1994.
rd
[Tanenbaum 1996] A. S. Tanenbaum, Computer Networks: 3 Edition, Prentice Hall PTR,
1996.
[VoiceXML] http://www.w3.org/voice/ W3C Voice Activity group.
[WAPj http://wapforum.org/ Wireless Application Protocol, and Wireless Markup Language

specifications.
[Wang el al1997] W. Wang, 1. Yang, and R. Muntz. STING: A Statistical Information Grid
Approach to Spatial Data Mining. Proc. 1997 IntI. Conf. Very Large Databases
(VLDB'97), pp: 186-195.
[Wexelblat & Maes 1999] Alan Wexelblat and Pattie Maes, Footprints: History-Rich Tools
for information Foraging, CHI'99.
[Wright et al 1995] G. R. Wright, W. R. Stevens, The Implementation (TCPIIP Illustrated

volume 2), Addison-Wesley 1995.
[Wu et al2001j Z. Wu, W. Meng, C. Yu, and Z. Li, Towards a Highly-Scalable and Effective
Metasearch Engine, in Proc. Of the 10th IntI. World Wide Web Conference (WWWIO),
May 2-5, 2001.
[WWW Commerce Site Examples] http://store.yahoo.com!, http://shopping.yahoo.com!,

http://auctions.yahoo.com, http://www.ebay .com!, http://www.amazon.com!
[WWWGoogle] http://www.google.com!
[WWW Mobile sites] http://mobile.yahoo.com/, http://phone.yahoo.com/,

http://www.nokia.yahoo.com!, http://www.sarnsung.com!, http://www.palm.com!,
http://www.ntt-docomo.com!
[WWW.NET] http://www.Microsoft.com!net
[WWW P3Pj http://www.w3c.org/p3p
[WWW SearchEngineWatch]
http://searchenginewatch.Intemet.com/reports/sizes.html
[WWW SearchDireclory Sites] http://www.yahoo.com!, http://google.yahoo.com!,
http://www.rnsn.com!
[WWW Ullman] http://www-db.stanford.edu/-ullman/mininglmining.html
[WWWWebSphere] http://www.ibm.com!websphere
[WWW Yahoo!] http://www.yahoo.com!
[2amir & Etzioni 1998] Oren 2amir, and Oren Etzioni, Grouper: A Dynamic Clustering
Interface to Web Search Results, in Proc. Of the 7th World Wide Web Conference
(WWW'7), 1998.
[Zipf 1949] G. Zipf, Human Behavior and the Principle ofLeast Effort. Reading, MA:
Addison-Wesley, 1949.
Acronyms
2G Second Generation Wireless Network

3G Third Generation Wireless Network
AMPS Advanced mobile Phone Service
B2B Business to Business
BSC Base Station Controller
BTS Base Tranceiver Station
CARP Cache Array Routing Protocol
CDMA Code Division Multiple Access
CGI Common Gateway Interface
cHTML Compact HyperText Markup Language
CSS Cascading Style Sheets
DNS Domain Name System
Ecommerce Electronic Commerce
EIR Equipment Identity Register
FDMA Frame Division Multiple Access
FTP File Transfer Protocol
GPRS General Packet Radio Services
GPS Global Positioning System
GSM Global System for Mobile Communication
HDML Handheld Device Markup Language
HITS Hyperlink Induced Topic Search
HLR Home Location Register
HSCSD High-Speed Circuit Switched Data Technology
HTCP HyperText Caching Protocol
HTML HyperText Markup Language
HTTP HyperText Transfer Protocol
HTTPS HyperText Transfer Protocol Secure
ICP Internet Cache Protocol
1M Instant Messaging
lMAP Internet Message Access Protocol
IP Internet Protocol
LSI Latent Semantic Indexing
MSC Mobile Services Switching Center
P3P Platform for Privacy Preferences
PLMN Public Land Mobile Network
POP Post Office Protocol
PSTN Public Switched Telephone Network
SIM Subscriber Identity Module
SMS Short Messaging Services
SMTP Simple Mail Transfer Protocol
SOAP Simple Object Access Protocol
SSL Secure Sockets Layer
TCP/IP Transmission Control protocoVIntemet Protocol
TDMA Time Division Multiple Access
TF-IDF Term Frequency-Inverse Document Frequency
UDDI Universal Description, Discovery and Integration
UDP User Datagram Protocol
URI Universal Resource Identifier
URL Uniform Resource Locator
VLR Visitor Location Register
VXML VoiceXML
WAP Wireless Application Protocol
WDP Wireless Datagram Protocol
WML Wireless Markup Language
WSDL Web Services Description Language
WTLS Wireless Transport Layer Security
WWW World Wide Web
XHTML extensible HyperText Markup Language
XML eXtensible Markup Language
XSL extensible Style Sheet Language
XSLT eXtensible Style Sheet Language Transformation
Index Common Gateway Interface (CGI) 70
compact HTML (cHTML). 220
.NET Confidence 142
web services 191,244 Association rules 142
crawling procedure 118
CURE
A Priori Algoritlun
clustering 161
association mining 143
agglomerative clustering 161
ARPAnet 84, 178 Data mining 141
Association mining See Association rules Data Mining 140
Association rules 142 Decision trees ISO
attribute Density Based
XML attribute 21 clustering 162
Authorization Description
e-commerce 77,182,201,202 web service 239
Digital Certificates 80
Discovery
Back link count
web services 239
ranking 129
distiller
BFR algoritlun
focussed crawling 119
Web Mining, clustering 159
DNS 57, 58, 85, 213, 239, 252
billing 196,198,199,200,201,202,203
DNS round-robin
BizTalk
load balancing 68
e-eommerce frameworks 189
Document Matrix 107
Document Type Definition 18, 19,20,28
Cache Array Routing Protocol (CARP) domain name 57
67 Domain Name Service 57
cache digests 67 DTD 18, 19,20,24,28,51
Caching protocols 67 dynamic search 127
Cascading style sheets (CSS) 48
Causality
ebXML
association mining 142
CDATA
ECo framework
XML24
centroids
e-eommerce 8,187,188,196,197, 198,
clustering 160, 161, 163, 164
203
challenge and response 79
Electronic mail, e-mail
security 79
protocols 179
CHAMELEON
electronic payment
clustering 161
e-commerce 193
Character font and sizing
E-mail 178
HTML44
Encryption 79
classification 7, 135, 145, 146, 147, 148,
episode mining 165
149, ISO, lSI, 155,252,253,272
error function 149, ISO, 159
click-stream analysis 167
Expectation-Maximization
click-trails 168
clustering 160
Client/Server 64
eXtensible Markup Language 17
clustering 7,90, 101, 103, 157, 158, 159,
Extensible Style Sheets
160, 161, 162, 163,253
XSL,XSLT29
collaborative filtering 153
Index
Indexing 96
FastMap information retrieval xix, 3, 4, 5, 6, 7, 88,
clustering 163 110, 125,251,252,271
Fetuccino 128 instant messaging xix, 178, 184, 186,253
FishSearch 127 Interactive Financial Exchange
Focussed Crawling 127 e<ommerce 189,194
Forms Internet Cache Protocol (lCP) 67
HTML47 Internet Message Access Protocol
fraud protection 198 (lMAP)183
Intemet Open Trading Protocol (IOTP)
General Packet Radio Service (GPRS) e<ommerce 192
231 Inverse Document Frequency 100
gini'index Inverted fJles 96
decision trees 152 IP
Global System for Mobile internet protocol 55
Communication (GSM) 209 IPv657
GPRF itemsets
clustering 161 association mining 144
grid based
clustering 162 Java E-Commerce framework
GSM209 e<ommerce platforms 191
Handheld Device Markup Language Karhunen-Loeve (KL) transformation

(HDML)220 web mining 170
hardware based routers/switches K-means
load balancing 68 clustering 159
Head and Body k-menoids
HTML43 clustering 159
Hierarchical Methods
clustering 160 Latent Semantic Indexing 7, 106
High-speed circuit-switched data linear classifier 146, 149, 155, 156
technology (HSCSD) 231 link prediction 167
HITS Links
Web graph analysis 130 HTML45
HTML 16,41,42 Load-balancing 68
HTTPS82 location metric
hubs and authorities ranking 130
web graph analysis 130
Hyper Text Transfer Protocol (HTTP) 71, Meta-search 125
252 Micropayment
Hypertext Markup Language 16,41 e<ommerce 193,194
MilliCent
IFX e<ommerce, payment 194
e<ommerceI89, 194, 195, 196 Minimum Description Length (MOL)
1M 152
instant messaging 8, 178, 184, 185, Mirroring
186,187,205 of web sites 68
Images Mobile 5, 8, 207, 274, 275
HTML46
Index
mobile-originated short message(MO- Return on Investment (R01)

SM)227 for web sites 166
mobile-tenninated short message (MT- RosettaNet 193
SM)227
multi-layered neural network 149, 150 SavySearch 126
Schemas
Neural networks 148 XML25
N-gram stemmers 93 Scripting 49
Non-Repudiation Secure Electronic Transaction (SET)
web security 79, 83 e-commerce 194
NTTDoCoMo Secure Sockets Layer 82
mobile 231 security 5, 6, 66, 78, 79, 85,186,201,
212,231,244
OMG architecture Security 188
e-commerce systems 191 semantic web 249
Open Applications Group Server Farms 68
e-commerce 192 Server Log Analysis
Open Buying on the Internet (OBI) 192 Web Mining 166
Open Financial Exchange (OFX) Server logs 168
e-commerce, payment 194 Settlement
e-commerce 192,201,202
SGML 16,41
P3P 84
Page rank SharkSearch 128
Short Message Service Center (SMSC)
ranking 129
Paragraph, breaks, and headings mobile 227
HTML43 Short Messaging Service (SMS) 227
Partition methods Signal Transfer Point (STP)
mobile 227
clustering 158
Platform for Privacy Preferences 84 signature files 99
Porter's algorithm Simple Mail Transfer Protocol (SMTP)
stemming 93 180
Post Office Protocol (POP) 181 Simple Object Access Protocol (SOAP)
Precision 108, 109, 110 240
Predictive modelling 145 Singular Value Decomposition (SVO)
prefix tree 97,98,99 107
Proxy 65,66, 76, 77, 207, 224, 252 SM algorithm
collaborative filtering 153
Smart Cards
query expansion 7,87,90, 106, 125, 126,
e-commerce, payment 193
127,274 SMTP59
Query expansion SOAP 8, 189,240, 242, 244, 253
Web 127
Spamming 125
split points
ranking 7,87,90, 100, 101, 129, 136,252 decision trees 152
Recall 109, 110 SSL 82
Recommendation systems 168 stemming 90, 92, 94, 95, 96, 97, 110
regression 145, 148 stoplists 90, 92
Regression 147 Subscriber Identity Module (SIM) card
Resource Description Framework (RDF) 231
249,257
Index
Support
Association rules 143 XHTML 48,50,219,220,223,251,272,
Support Vector Machines 150 284
synunetric keys XML 13,51
security 81 Xpath41
Tables Zipf distribution 66

HTML46
TCP segment 60
TCPIIP 59
Tenn Frequency 100
TF-IDF 87, 100, 101, 106, 107, 110
thesaurus 90,95, 110
Time Division Multiple Access (TDMA)
209
tokenization 90, 91, 110
Tokens
e-commerce, payment 193
Transaction Management 202
Transcoding 222
Transmission Control Protocol 60
00018,239,240,241,253
OOP63
Unique Universal Identifier (UUID)
IFX, e-commerce 196
unsupervised learning 157
URL58
User Datagram Protocol 63
Valid Document 28
Web Commerce 187

Web directories 132
Web Mining 165
Web Security 78
Web Service Inspection Language
(WSIL) 248
WebGlimpse 127
well-fonned XML document 27
Wireless Application Protocol (WAP)
211
Wireless Markup Language (WML) 211
Wireless Web 8, 207, 208
WMLCard214
WML2.0219
World Wide Web Wonn 116

Foundations of Web Technology

Uploaded by

Copyright:

Available Formats

You might also like

Foundations of Web Technology

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Foundations of Web Technology

Uploaded by

Copyright:

Available Formats

FOUNDATIONS OF

SPRINGER SCIENCE+BUSINESS MEDIA, LLC

Copyright © 2002 by Springer Science+Business Media New York

Permission for books published in Europe: permissions@wkap.n!

Printed an acid-free paper.

5 Web Search and Directory 115

6 Web Mining 139

7 Messaging and Commerce 177

8 Mobile Access 207

9 Web Services 237

Figure 1. Growth of the World Wide Web .4

Figure 41. Web service integration procedure 242

Table 1. Layers of data abstraction 13

Table 40. Collaborative Filtering data for exercise 8 175

Dr. Ramesh Rangarajan Sarukkai is currently a senior architect at Yahoo!

Writing a book that encompasses a wide range of topics requires a lot of

Firstly, the anonymous reviewers gave useful feedback on the book

Without the blessings of God, and the strong support of my parents, I

positive attitude. My mother Santha has always been supportive and

Most books on Web focus on programmatic aspects of languages such as

This book is compelling as a course that covers the foundations on Web

1. WORLD WIDE WEB

Web technology is pervasive, compelling and here to stay. The concepts

R. R. Sarukkai, Foundations of Web Technology

few million to billions of documents, and efforts continue to provide

Figure J. Growth of the World Wide Web.

3. WHAT'S COVERED IN THIS BOOK

The term "Web Technology" covers a wide variety of topics. The

is a good representation system that can be formally defined, verified and

The second fundamental area that is discussed in this book is networking.

The third fundamental area covered in this book is information retrieval

Other topics that may be considered fundamental include data

4. ORGANIZATION OF THE BOOK

retrieval systems such as stopword elimination, indexing and search,

The second part consists of Web' applications. A subset of the vast

• Web Search & Directory

• Messaging & Commerce

Commerce is another major area of application in the Web. E-

distributed, de-centralized environment. Examples of web services

Form is in matter, rhythm in force, meaning in person

Abstract: Structural specification of infonnation is of paramount importance to the Web.

Keywords: Data representation, extensible Markup Language (XML), extensible Style

R. R. Sarukkai, Foundations of Web Technology

1.1 Communication of information

Structural description is an important aspect in the communication of

An important observation that can be made in the history of human

1.2 Layers of Data abstraction

Information transmission across the Internet can be abstracted at various

for many international languages). At the next layer of abstraction, the

2.1 Separation of structure and presentation

The primary motivation of data markup languages is the separation of

Figure 2. Information - Structure - Presentation Separation

The separation of structure and presentation from the actual content of

The sentence contains information about "Albert Tan": professional

The same information can be structurally decomposed as follows:

Name: Albert Tan

The structural hierarchy can be represented graphically as follows:

Figure 3. Graphical representation of information structure.