Professional Documents
Culture Documents
Foundations of Web Technology
Foundations of Web Technology
Foundations of Web Technology
WEB TECHNOLOGY
THE KLUWER INTERNATIONAL SERIES
IN ENGINEERING AND COMPUTER SCIENCE
FOUNDATIONS OF
WEB TECHNOLOGY
by
Ramesh R. Sarukkai
Senior Architect, Yahoo Inc, USA.
Sarukkai, Ramesh R.
Foundations of Web Technology
ISBN 978-1-4613-5409-3 ISBN 978-1-4615-1135-9 (eBook)
DOI 10.1007/978-1-4615-1135-9
AlI rights reserved. No part ofthis work may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without the written
permission from the Publisher, with the exception of any material supplied
specifically for the purpose of being entered and executed on a computer system,
for exclusive use by the purchaser ofthe work.
Contributors xv
Acknowledgements XVll
Preface xix
Part 1 Fundamentals
1 Introduction 3
1. WORLD WIDE WEB 3
2. CORE TECHNOLOGY 4
3. WHAT'S COVERED IN THIS BOOK 4
4. ORGANIZATION OF THE BOOK 6
2 Data Markup 11
1. INTRODUCTION 12
2. DATA MARKUP 13
3. EXTENSIBLE MARKUP LANGUAGE (XML) 17
4. EXTENSIBLE STYLE SHEETS 29
5. XPATH 41
6. HYPERTEXT MARKUP LANGUAGE (HTML) 41
7. CONCLUSION 50
FuRTHER READING 50
EXERCISES 50
Contents
3 Networking 53
1. INTRODUCTION 54
2. LAYERING OF NETWORKS 54
3. LOCATING ENDPOINTS 55
4. TRANSMISSION PROTOCOLS 59
5. CLIENT/SERVER 64
6. HYPER TEXT TRANSFER PROTOCOL (HTIP) 71
7. WEB SECURITY 78
8. PRIVACY 84
9. CONCLUSION 84
FuRTHER READING 85
EXERCISES 85
4 Infonnation Retrieval 87
1. INTRODUCTION 88
2. COMPONENTS OF IR SYSTEM 89
3. TEXT PROCESSING 90
4. INDEXING AND SEARCH 96
5. RANKING 100
6. QUERY OPERATIONS 104
7. LATENT SEMANTIC INDEXING 106
8. EVALUATION METRICS 108
9. CONCLUSIONS 110
FuRTHER READING 110
EXERCISES III
Part II Applications 113
5. CLUSTERING 157
6. OTHER DATA MINING PROBLEMS 165
7. EXAMPLES OF WEB MINING 165
8. CONCLUSION 172
FuRTHER READING 172
EXERCISES 173
10 Conclusion 251
1. REvIEw 251
2. SYSTEM DESIGN OVERVIEW 254
3. LIMITAnONS 256
4. ThE fuTURE 257
APPENDIX 261
REFERENCES 271
ACRONYMS 283
INDEX 285
List of Figures
The editorial staff at Kluwer have been very supportive and helpful with
my many simple formatting and editorial questions: I would like to
especially thank Sharon Palleschi, and Susan Lagerstorme-Fife for their
prompt responses and formatting support. I cannot list the many other
people, who have shaped or influenced my thoughts and thus the contents of
this book, but I thank them for that, and I apologize for any unintentional
omissions. While all the good in this book is attributable to the feedback,
discussions and support of the many people, any error or fault is my own.
The poetic quotations at the beginning of the preface and chapters are
from the work "Fireflies" by Nobel Laureate Rabindranath Tagore.
Preface
"Birth is from the mystery ofnight into the greater mystery ofday"
My idea of writing a book on the concepts that power the Web started in
1999. The huge growth of the Web has fuelled a variety of applications that
are being used by millions around the world. Despite the ups and downs in
the dot-com business world, Web technology solidified over the last decade.
The applications that power the Web derive their strengths from a diverse
range of technology such as information retrieval and mobile data access.
Audience
This book has been written to appeal to a wide range of audience. For a
person interested in understanding the basic concepts of Web technology,
this book covers the fundamentals and the techniques needed to build Web
applications. For the professional who has worked on specific parts of Web
or related technology, this book will provide a broad understanding of the
architecture of different applications on the Web, and how they relate to
each other. The techniques are discussed both from a conceptual level as
well as a practical level, so that the ideas discussed can be translated to real-
world prototypes. The pedagogical style of the book coupled with the
numerous examples, illustrations and exercises makes the content accessible
to a wide variety of audiences.
Course Textbook
FUNDAMENTALS
Chapter 1
Introduction
'Let me light my lamp', says the star. 'And never debate if it will help to remove the darkness'
The World Wide Web has grown phenomenally over the last decade,
Ranging from the growth of Web servers across the Web, the rapid adoption
of Hyper-Text Markup Language, to the availability of information in the
form of hundreds of millions of Web pages that are linked to each other, the
Web has changed the way in which information is accessed, how people
communicate with each other and how buying/selling activities are
accomplished with electronic commerce. With such rapid growth and
adoption of the Web, technology has been struggling to keep up with
emerging standards, highly competitive marketplace, and a maturing
technology. Over the last decade, Web technology has transformed into a
unique field of study. While the underlying technologies are distributed
systems, networking, information retrieval and security, the specialization
and application of these technologies to build scalable Web systems makes
the study of Web technology unique.
1600
1400
" .. I
",
1200 "., ,"'
1000 o #Web Pages
800 ',' .' indexed by
r,
600 Search Engines
400 '" (in millions)
200
oI -l"'Wnlr
1995 1997 1999 2001
2. CORE TECHNOLOGY
How does one build scalable Web applications? What are the underlying
technologies that make the Web work? What are the issues and problems
that need to be addressed? These are some of the questions that are answered
in this book. Unlike many other fields where research and development are
focussed on a specific problem or field of interest, the Web is derived from a
diverse set of technology. There is no one area of specialization that will
suffice to build Web applications. Rather, there are many underlying
technologies that make the Web work, each with specialized studies to
develop and tailor algorithms to each Web application.
Now lets tum our attention to Web applications. The set of Web
applications discussed in this book include:
a. Directory & Search
b.Web Mining
c. Messaging & Commerce
d. Mobile Access
e. Web Services
Why did we limit our study to the above areas? Why did we not include
many other applications such as streaming and broadcast services,
personalization, listings, maps, auctions, media, finance, news, and business
to business. Each application has its own unique problem-specific issues and
technological hurdles. The applications discussed in this book are chosen to
be representative of web applications at large.
6 Foundations of Web Technology
This book is divided into two parts: fundamentals and applications. The
fundamentals part covers the basic technology that drives much of the Web
development. The fundamentals part consists of the following chapters:
• Data Markup
The chapter on Data Markup motivates the need for a standard
representation of data. Since the Web is a resource for a vast amount
of information, it is vital that this information is represented in a
format that is useful for exchange between businesses, vendors, or
even just clients. The extensible Markup Language has been defined
for this purpose. Another important aspect of data representation
systems is the need for transforming data specified in one form to
another, and the presentation of this data. The extensible Style Sheets
(XSL) approach to doing this is illustrated with examples. The
structural integrity of XML documents is maintained by defining the
corresponding "Document Type Definition (DTD)" or "Schema".
The Hyper-Text Markup Language (HTML) that fuelled the
widespread adoption of the Web falls into the same category as a
data markup language.
• Networking
Networking is the backbone of the Web. How do two end-points
identify each other on the Web? What protocols do they use to
communicate with each other? How do Web systems ensure security
and privacy between communicating parties? What is the
client/server architecture? What is a "proxy" and how does one
distribute and replicate data over the Web? What are some
communication protocols used in distributed cache systems? These
are some of the issues presented in the chapter on "networking".
Protocols such as TCP/IP suite, HTTP, SSL, and web security
methods are discussed in this chapter. Protection of privacy using the
Privacy for Platform Preferences(P3P) is also summarized.
• Information Retrieval
The third fundamental area that is an integral part of many Web
applications is information retrieval and text analysis. While
information retrieval is a broad subject with applications ranging
from text analysis to speech/multimedia indexing, the techniques
used for processing, indexing and retrieving documents from textual
databases are presented in this chapter. The steps involved in text
Ramesh R. Sarukkai 7
• Web Mining
The Web consists of hundreds of millions of documents, with
billions of page views, and keyword searches everyday. Furthermore,
it is possible to track users' interests, sites visited, goods purchased
online and various other information. Such information can be mined
using data mining techniques in order to determine trends, general
user interests, and enhance the personalized Web access features to
the end consumer. Of course, the most important aspect of Web
mining is ensuring that the users' privacy is respected, and the data
collected with the user's authority is protected from misuse. An
overview of data mining techniques ranging from association mining,
classification, clustering, and sequence matching is presented with
examples. How such techniques are applied to the Web is discussed
in this chapter on Web mining. Some of the applications discussed
include server log analysis, link prediction and recommendation
systems.
was integrated with the Web, making it easy for people to manage,
and use their messaging facilities. Instant messaging and chat are
other prominent messaging applications that handle millions of
users'. Instant Messaging (1M) system is discussed to illustrate the
design issues involved. .
• Mobile Access
One of the trends that emerged in the last few years is the notion
of mobile access to Web information. Mobile Web (or "Wireless
Web" as its sometimes called) is a combination of access to Web
information and the notion of mobility. With wireless devices, users'
are able to be away from their computers, yet have access to the
information from the Web using wireless devices. How does the
integration of mobile and Web technology work? As an illustration,
the Wireless Application Protocol Suite is discussed along with the
wireless markup languages. Other wireless messaging techniques
such as Short Messaging Service (SMS) are also covered.
• Web Services
The last application discussed in this book is the emerging notion
of "Web Services". Although all the applications discussed in the
earlier chapters are services on the web, the term "web services" is
used to refer to a more generalized and formalized notion of
providing services on the Web. In the web services framework, the
main objectives are remote execution, platform independence and
ease of integration. In order to cater to such requirements, the web
services architecture utilizes the following abstraction layers: service
definition, service discovery, transport layer and the execution
environment. Universal Description, Discovery and Integration
(UDDI) registry enables the registration of service descriptions.
Service defmition is achieved using Web Services Description
Language (WSDL). At the transport layer protocols such as Simple
Object Access protocol (SOAP) is used to exchange information in a
Ramesh R. Sarukkai 9
The final chapter is the conclusion· chapter that summarizes the topics
discussed in this book, and highlights some future directions in web
technology. The Appendix lists useful information for the actual
implementation of the exercises and projects. The glossary is useful to
lookup a set of frequent acronyms used in this book.
Chapter 2
Data Markup
1. INTRODUCTION
r.abilL
e . ayers 0 fd ata abstractIOn
Abstraction Layer (XML)
Coding Layer (ASCIIlUnicode)
Bytes
Bits
Signals
2. DATA MARKUP
Structure Presentation
Albert Tan M.D. lives at 1234 Main Street, New York, New York.
The residential address can be further decomposed into street, city, and
state.
Residential Address:
Street: 1234 Main Street
City: New York
State: New York
Infonnation
Next, let us consider the rendering of this information. One format may
be to highlight the name in bold font, and underline the address.
Albert Tan M.D. lives at 1234 Main Street, New York, New York.
Albert Tan M.D. lives at 1234 Main Street, New York, New York.
Data XML
Structure
HTML
Display
SGML
Printers
RTF
SGML introduced the notation of "start" and "end" tags. For instance, a
particular structural component (e.g. title) is specified by embedding it in
angled brackets as <title>. All text following this start tag will be classified
as a part of that structural component. The end of this component is denoted
by embedding the tag between "<I" and ">", as in </title>. In the example
below, the title is specified by the text "Introduction to Information Theory":
3.2 Tags
One can notice many aspects in the above XML format of the
information. Each element is enclosed in "<" or ">" symbols. If the
beginning or start of the tag is specified, then the tag is enclosed in "<", and
">". If the end of the tag is specified, then it is enclosed in "<" and "I>".
Information is specified between the relevant start and end tags.
<Address>
<Street> 1234, Main Street </Street>
<City> New York <ICity>
<State> New York </State>
</Address>
XML allows the definition of such hierarchies, and ensures that the
hierarchy is adhered to.
Until now, we have defined the notion of hierarchy, tags, and how XML
documents adhere to the structure imposed on them. But how does one
impose or define the XML structure? In order to define the structure of the
XML documents, the Document Type Definition must be defined and
specified for each XML document. An alternative specification is the
Schema that is discussed in a later sub-section.
Ramesh R. Sarukkai 19
<?xml version="l.O"?>
<!-Document Definition Starts here -->
<!DOCTYPE EmployeeRecord [
<!ELEMENT EmployeeRecord (Name I Degree I Address)+>
<!ELEMENT Name (#PCDATA»
<!ELEMENT Degree (#PCDATA»
<!ELEMENT Address (#PCDATA»
]>
<!-Document Definition Ends here -->
The first line of the XML document is the XML processing instruction
that specifies the version ofXML used. Additionally, other information such
as the character encoding used in the document may be specified. All
commands start with <! and end with >. The DOCTYE allows the definition
of the document definition embedded within the XML document. In this
example, the document is defined to contain the tags EmployeeRecord that
contains the tags Name, Degree, and Address. Following the document
definition, the actual XML document content follows.
Following the DTD, the actual XML content follows, starting with the
root start tag. In this case, the root element is the "EmployeeRecord". Within
the EmployeeRecord start and end tags, the Name, Degree and Address
20 Foundations of Web Technology
elements are embedded. Note that XML is case sensitive, and so one cannot
mix lower-case and upper-case start and end tags. Furthermore some special
characters need to be escaped: such as < to &It; > to >, & to & and
so on. Comments can be embedded within <!-- and --> as shown in the
example.
<?xml version="l.O"?>
<!DOCTYPE SYSTEM "ER.dtd">
<EmployeeRecord>
<Name> Albert Tan <!Name>
<Degree> M.D. </Degree>
<Address> 1234, Main Street, New York, New York </Address>
</EmployeeRecord>
refers to Parsed Character data, and does not contain unexpanded markup or
entity references (see next subsection). In our example, the tag
EmpIoyeeRecord is defined to contain (Name I Degree I Address)+. This is
a regular expression that allows one or more tags (of the type Name or
Degree or Address) to be embedded within the EmployeeRecord tag. Other
content types include ANY and EMPTY. ANY type allows content that is
composed of any mixture of elements in the DTD or character data. EMPTY
refers to tags that do not have any content associated with them.
<Student!>
or
<Student> </Student>
3.4.2 Attributes
</Address>
Attribute type ID restricts the value of the attribute to be unique for each
instance of the element. IDREF allows the attribute to refer to a previously
definedID (although just once), and IDREFS allows multiple reference to
previously defined attribute IDs . An example of use of ID is for
identification purposes:
In the above examples, you will note that the last value is always
"#REQUIRED". This refers to the default value of the attribute.
#REQUIRED specifies that the value of the attribute must be explicitly
specified (i.e. should not be absent). Attribute values can be implied, in
which case the parser decides what value can be the default, and the attribute
may be ignored in the XML document. Implied attributes are specified using
the #IMPLIED directive. Constant attribute defaults are specified using the
#FIXED directive.
Finally, the #SUPPLIED directive indicates that the default value of the
attribute is supplied by the DTD.
<Name>
<![CDATA[
Laurel & Hardy
]]>
</Name>
In the above example, the text "Laurel & Hardy" is not processed by the
XML parser (thus the & is not escaped to "&"). Other uses of the
CDATA sections are to define programs (example scripting) that will allow
more programmatic control of the XML data.
3.4.4 Entities
DTDs are derived from SGML, and have several shortcomings. Firstly,
DID syntax inherits somewhat obscure use of special characters such as #
used in defining the document structure. SGML was mainly developed for
text markup, and thus DTDs also cater to text markup applications. In
particular, DTDs are not easily applicable to XML application areas such as
object serialization, databases, and inter-process communication. Attribute
typing in DTDs are also very restricted. Schemas were defined to address
shortcomings of DTDs and the schema itself is in XML.
The first thing to note is that the schema is itself in XML, thus eschewing
the need for another parser and system for interpreting DTD Language
syntax. The other important aspect of schemas is the strong typing. In DTDs,
typing is restricted to EMPTY, ANY, element content, and mixed element
content and text. It is often useful to add other restrictions on the typing like
enforce integer constraints, or limit the string length. XML Schema allows
such generalized type specifications as shown in Table 6:
<xsd:complexType name="FullName">
<xsd:sequence>
<xsd:element name="FirstName" type="xsd:string"/>
<xsd:element name="LastName" type="xsd:string"/>
</xsd: sequence>
</xsd: complexType>
<xsd:complexType name="EmployeeName">
<xsd:complexContent>
<xsd:extension base="ipo:FullName">
<xsd:sequence>
<xsd:element name="MiddleName" type="string"/>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd: complexType>
XML Schemas also support modularity in the sense that the XML
schema definition can control whether this archetype can be 'exported' to
other schemas. In the previous example, we have derived EmployeeName
from FullName and added the element MiddleName. Thus, we can see how
there is an "object-oriented" ability to "inherit" structural representations.
XML schemas also allow the grouping of attributes.
structures. An example where the element Age is defined and referenced for
inclusion in the element CustomerRecord is shown below in Table 8:
<EmployeeRecord>
<Address>
1234, Main street
</Address>
</EmployeeRecord>
The above XML snippet is well-formed since the start and end tags
match and they are properly nested. An example of an improperly nested
XML is shown below:
<EmployeeRecord>
<Address>
1234, Main Street
28 Foundations of Web Technology
</EmployeeRecord>
</Address>
Another case is one where the start and end tags are just mis-matched:
<EmployeeRecord>
</Address>
The following is also not well-formed since XML tag names are case
sensitive:
<ADDRESS>
</address>
What happens when we merge two XML documents whose elements are
named the same? XML namespaces are defined to address this problem of
replicated naming in XML documents. For instance, the element 'table'
might have elements 'length' and 'width' embedded within for a furniture
XML Schema, whereas the element 'table' might have elements 'stock' and
'value' embedded within for a financial data XML schema. If we merge the
two XML documents into one, then how do we parse, and validate the
merged document? We can associate namespaces for each individual
element. For instance, for the furniture table element, we can have:
<table xmlns=..http://www.example.comlfurniture..>
<length> 4 feet </length>
<width> 5 feet </width>
</table>
Ramesh R. Sarukkai 29
<table xmlns=..http://www.example.comlfinance..>
<stock> IBM </stock>
<value> 150 </value>
</table>
Note that both are elements "table", but the namespace associated with
each element distinguishes the element definition. An alternate
representation is to prefix the name space as follows:
<f:table xmlns:f=..http://www.example.comlfinance..>
<f:stock> IBM </f:stock>
<f:value> 150 </f:value>
</f:table>
<EmployeeRecord>
<Name> Albert Tan <!Name>
Albert Tan
The display software (e.g. editor, browser) interprets that above tags and
renders them appropriately. Thus, in the above example, the name "Albert
Tan" will be automatically presented in bold since its embedded in the
"Bold" start and end tags. If another device or viewing software interprets a
different set of tags, then the same EmployeeRecord XML document can be
converted appropriately.
Can we take the original XML EmployeeRecord document and apply the
rules of rendering to automatically generate the XML representation with the
font directives? That's exactly what eXtensible Stylesheet Markup Language
(XSL) enables us to do. XSL is a language defined in XML that allows the
description of how XML documents can be transfonned from one definition
to another.
XML Document
<Name> Albert </Name>
\
Styl, / Style
Sheet A
Sheet B
~
Albert Albert
<EmployeeRecord>
<Name> Albert Tan </Name>
<Degree> M.D. </Degree>
<Address> 1234, Main Street, New York, New York </Address>
</EmployeeRecord>
32 Foundations of Web Technology
</xsl:for-each>
</xsl:stylesheet>
<!-Ending stylesheet tag 7
Next, the xsl-for-each element allows the looping over all the selected
"EmployeeRecord" elements in the XML document on which this stylesheet
is applied on. Note that the style specification is embedded in this stylesheet
itself: the <Bold> and </Bold> tags are the tags of the XML document that
we want the XML in table 6 to be transfonned into. Lastly, embedded within
the <Bold> and </Bold> tags, is the selection of the actual value of the
Name element of the selected EmployeeRecord.
Table 11. Result XML document when stylesheet in Table 10 is applied to XML in Table 9.
<Bold>
Albert Tan
<!Bold>
Another way of looking at the source XML document in Table 9 and the
resultant XML document in Table 11 is shown in Figure 6. The source XML
document has a root element EmployeeRecord which can contain the Name
and Address element blocks. Contained within the Name elements is the
PCDATA "Albert Tan". The transformed XML document has a root element
"Bold" which contains the PCDATA that is contained under the Name
element in the original XML document, "Albert Tan" in this case. It is
important to observe that the interpretation or rendering of XML elements
themselves are inconsequential to the transformation process. If another
XML language is required to present the document in another style, then
another stylesheet can be applied to the original XML document to derive
the new transformed XML document.
Employee
Record
Albert Tan
The transformation from the source to the output XML can be done using
XSL. Since XSL is stylesheet language that's itself defmed in XML, the
transformation can be specified using the elements and attributes that are
allowed in XSL. In this section, the elements of XSL version 1.0 are
outlined.
• xsl:template
Any XML document can be represented in the form of a tree. Each node
in the tree can be specified using a set of conditions that identify those
nodes. For example, a simple pattern is to match any element using the
pattern expression H*" as in:
<xsl:template match="*">
</xsl: template>
• xsl:apply-templates
The required template can be applied to all the children of the selected
elements using the xsl:apply-templates. The xsl:apply-templates has
attributes 'select' and 'mode'. If the 'select' attribute is not specified then all
the elements of the current node are processed.
<xsl:template match="EmployeeRecordlName">
<Bold>
<xsl:apply-templates/>
</Bold>
</xsl: template>
<xsl:template match="EmployeeRecordlName">
<Bold>
<xsl:apply-templates select="FirstName"/>
</Bold>
</xsl: template>
• xsl:value-of
table 7, the xsl:value-of tag is used to extract the data in the Name tag, and
this data is embedded within the bold tags in the template.
<xsl:value-of select="Name"/>
• xsl:key
• xsl:for-eacb
Looping over nodes that meet a criteria can be done using the xsl:for-
each XSL tag. The criteria for the nodes are specified by the expression
using the select attribute of the xsl:for-each tag.
<xsl:for-each select="EmployeeRecordlName">
<Bold>
<xsl:apply-templates/>
</Bold>
</xsl: for-each>
The xsl:for-each in the above example enables looping over the nodes of
the XML tree that match the select criteria, i.e. are 'Name' elements that are
direct children of 'EmployeeRecord' elements.
• xsl:if
Conditional logic can be applied using the xsl:if element. The condition
to be checked is specified in the 'test' attribute of the xsl:if element.
• xsl:cboose
<xsl:choose>
<xsl:when test="conditionl">
</xsl:when>
<xsl:when test="condition2">
</xsl:when>
<xsl:otherwise>
</xsl:otherwise>
</xsl:choose>
• xsl:sort
<xsl:template match="/">
<xsl:apply-templates select="EmployeeRecord">
<xsl:sort select=''Name''/>
</xsl:apply-templates>
</xsl:template>
• xsl:import
A XSL stylesheet can import another XSL stylesheet using the xsl:import
element. The xsl:import element has a href attribute that specifies the uri of
the stylesheet to import into the current stylesheet. Elements contained in the
xsl:stylesheet need not be in any order, except for the xsl:import which must
precede all other elements under the xsl:stylesheet element.
38 Foundations of Web Technology
• xsl:include
It is also possible to include external stylesheets using the xsl:include
element. The bref attribute of the xsl:include secifies the document of the
external stylesheet to include. One may recall that xsl:import is also a
mechanism of importing external stylesheets into the including stylesheet.
The difference between include and import is as follows: when a stylesheet
is included, the elements belonging to the xsl:stylesheet are treated as if the
included stylesheet is in the including stylesheet. Stylesheets that are
imported using the xsl:import impose lower precedence on the imported
xsl:stylesheet and the contained elements.
• xsl:strip-space
xsl:strip-space enables the specification of those elements for which
whitespace needs to be stripped. The element list is specified in the attribute
"elements".
This will strip the whitespace in the PCDATA sections of the element
"Name".
• xsl:preserve-space
xsl:preserve-space is the opposite of xsl:strip-space. XsI:preserve-space
allows the specification of those elements for which the whitespace should
be preserved. The element list is specified in the attribute "elements".
Example:
<xsl:preserve-space elements=''Name''/>
• xsl:output
With the xsl:output tag , the format of the transformed XML result is
specified. Attributes of xsl:output include method, version, encoding, omit-
xml-declaration, standalone, doctype-public, doctype-system, cdata-section-
elements, indent and media-type.
For instance, the output method can be "xml" or "text". The default
output method is "xml", except for cases where the root node of the resultant
XML document has an element child, the expanded name of the first
element being htrnl, and any text preceding the first element and the root
node only contains whitespace.
Ramesh R. Sarukkai 39
<xsl:output cdata-section-elements="Name"/>
• xsl:variable
<xsl:variable name="test"/>
• xsl:element
40 Foundations of Web Technology
• xsl:attribute
<xsl:element name="test">
<xsl:attribute name="a"> abc </xsl:attribute>
</xsl:element>
• xsl:text
Text nodes can be added using the xsl:text element. This allows the
specification of PCDATA to be enclosed within the xsl:text element.
<xsl:text>
Laurel & Haurdy
</xsl:text>
• xsl:processing-instruction
<?xl-stylesheet href="test.css"?>
Ramesh R. Sarukkai 41
5. XPATH
In XSL, the context of the path starts with the document currently
processed. Thus "I" indicates the root node of the document. Some example
of path expressions are summarized in Table 12.
The text "current issue of Garden Magazine" is anchor text that specifies
a transition to the current (hypertext) issue of Garden Magazine. This is
displayed to the user by highlighting in some distinguishing manner, such as
underlining as shown above. When the user clicks on this text, the next
document is automatically retrieved and displayed to the user by the
browser. This is specified using the anchor tag using the name "A" for
Anchor. For instance, the example shown above can be written using the
anchor tag as follows:
• HTMLtag
The root HTML tag is <html>; thus every HTML document must be
embedded within the <htrnl> and </html> tags.
The HTML document can be divided into the head and body parts. The
HEAD tag defines the head portion of the document and includes elements
such as TITLE (to specify the document title), BASE (URL base for
resolving relative addresses), SCRIPT (used for scripting), and other
information such as meta information used only for other purposes such as
search engine indexing. The second part is the body of the HTML page and
is embedded within the BODY tags. An example of the HEAD and BODY
tags is shown below:
<htrnl>
<head>
<title> Example HTML Page </title>
<meta> This is a test page </meta>
</head>
<body>
Hello World
</body>
</htrnl>
paragraph can be specified using the ALIGN attribute for the paragraph
element (left, right, or center):
<body>
<p align=center> Hello World
</body>
Breaks may be added into the document by using the <br> tag. Heading
elements are specified by using the H tag. hI refers to heading level 1, h2 to
heading level 2 and so on. Headings are automatically rendered in
appropriate default fonts and sizes, but can be modified using style sheets
(see later part). Example of a two headings:
Hello World
With text level elements such as FONT allow more control over the font
size, color and typeface of the text. Style sheets (discussed later) are
preferable over explicitly controlling text fonts and sizes in the document.
BASEFONT enables setting a particular font size, and the FONT tag can be
used to set a particular font, size and color. For example,
Hello World
• Links
In our earlier example, we showed how the anchor tag (A) is used to
associate portions of text with specific URL transitions. The HREF attribute
refers to the destination document. Example of anchoring text:
The text which is embedded inside the A start and end tags has special
significance now. If that text is clicked, then the browser transitions to the
page specified by the HREF attribute value, in this case http://www.w3c.org
. The embedded text is generally underlined or highlighted in some manner
in order to distinguish the anchor text from other text. The hypertext link can
also refer to points within the same HTML document. This is achieved by
having labelled entry points in the HTML page, and referencing those label
points in the HREF attribute value, as shown below:
Another important concept is the notion of base URLs and relative URLs.
This is covered in later chapters, but it is meaningful to cover the concepts
here since they are relevant to the development of HTML pages. Base URL
refers to the base address name. Thus ..http://www.example.com/.. is the
base URL when a page such as ..http://www.example.comlhome.htrnl.. is
served by a machine on the network. If we want to have a link that gets the
page next.htrnl from the same machine, then instead of specifying the whole
path ..http://www.example.com/next.html... we can just specify the relative
difference in the address, i.e. "next.htrnl". This concept of the naming of an
URL and what it means will become clearer in the next chapter.
46 Foundations of Web Technology
• Tables
The Table element enables the creation of tables in HTML pages. Rows
are specified using the table row "TR" tag, data within each row are
specified by the table data <TD> tags, and the table header are specified
using the <TH> tags. Text within each table data cell can be aligned,
emphasized, and formatted by embedding in appropriate elements. The
TABLE element has the attributes FRAME, BORDER, BORDERCOLOR
and so on, that enable formatting of the table. An example of a table in
HTML is shown below:
<table>
<thead>
<tr> <th> Employee ID </th> <th> Gender </th> </tr>
<tbody>
<tr><td> 123456 <ltd> <td> Male <ltd> </tr>
<tr> <td> 99999 <ltd> <td> Female <ltd> <ltr>
</table>
• Images
<body>
<p> Enjoy the following picture <br>
<img src="dog.gif'>
</body>
Ramesh R. Sarukkai 47
In this example, the image referred to by the name "dog.gif' is retrieved and
displayed following the text "Enjoy the following picture". Images are
inlined when embedded using the IMG tag. Other attributes that control the
position, size and border of the images include the ALIGN (to control
alignment of text with respect to the image), WIDTH, HEIGHT (to specify
the size of the image), and BORDER (to specify border of image) attributes.
• Forms
Forms allow the collection of input from the user and enable the
submittal of that input back to the Web server. An example of a simple form
is shown below:
<form action="processrequest">
<p> phone number:
<input name="phone" size="10">
<p> <input type=submit>
</form>
Phone nurnber:I _
I Submit
Figure 7. Example of HTML form output.
When the user enters the 10 digit phone number, and clicks on the submit
button, the data is submitted back to the server, along with the
"processrequest" action being requested. The INPUT tag specifies varies
types of input such as text fields, checkboxes (where you can check or
uncheck an input selection), and radio buttons (where only one can be
selected). The SUBMIT type generates a submit button that can be pressed
to submit the form data to the server. The type attribute indicates what
protocol to use for the submission (such as GET and POST discussed in the
next chapter). Other features of form fields include hidden fields which are
48 Foundations of Web Technology
not displayed to the user, and password fields where the characters are
replaced by * when echoed back to the user.
In the earlier section, we discussed XSL in detail, and mentioned that the
styling of XML elements can be controlled using style sheets. Cascading
style sheets (CSS) were originally introduced to simplify the styling of
HTML documents. In an earlier example on controlling text fonts, we
illustrated how text styles can be controlled individually. Often its useful to
separate presentation of the text data from the actual content as discussed in
earlier sections. Style sheets allow you to achieve this separation by allowing
the association of specific text styles with individual element tags. For
instance, if we want to associate header HI with rendering using font size of
20, and color of black, then we can add the following style into the HTML
document
<style type="text/css">
hl { color: black; font-size:20; font-weight bold; }
</style>
The HTML page can use the <HI> without embedding any font or style
specifications. Style sheets can also be included from external sources by
using the link tag:
the W3C, and specifies a XML version of HTML. Standard HTML can be
cleaned and converted into XHTML using tools such as Raggett's HTML
Tidy. The key enhancements in XHTML are those present in XML: notion
of well-formed documents, enforcing of end-tags for non-empty tags, lower-
casing of tags and attributes, attribute values must be quoted, empty
elements must have a start and end tag or a ending I>, and so on.
6.6 Scripting
In HTML the SCRIPT tag indicates to the browser that the embedded
content needs to be executed. An example of a scripting language is Java
Script, as shown below:
<htmI>
<head>
<script language="JavaScript">
<!-- Example script
document.write("<BODY> <H 1> Welcome to the test page <VH I> <VBODY>");
-->
</script>
</head>
</html>
Once the browser processes this page, it executes the script on the client
side, and generates the page:
<htmI>
<head> </head>
<body> <hI> Welcome to the test page </hI> </body>
</htrnl>
50 Foundations of Web Technology
Scripting is especially useful for checking form input validity, and event
handling such as events on submission of form, motion of mouse over a
button, or click of a button.
7. CONCLUSION
In this chapter, we have motivated the need for data representation and
transformation. This enables the separation of document content from
presentation. XML defines a standard framework with which general data
markup languages can be defined. The XSL transformation provides a
general approach for the transformation of documents from one XML
representation to another. The Hyper Text Markup Language is a text
markup language that allows transitioning from one document to another
through links. The key aspect of XML is the structured representation of
documents using Document Type Definition or Schema, which can then be
used to validate the XML document. Key elements of HTML are discussed
with examples. Data markup is an integral part of the Web since a vast
amount of information needs to be communicated across the Web from
business to consumer or business to business.
FURTHER READING
EXERCISES
8. With examples, demonstrate how a style sheet can transfonn the data
specified by a DID in (5) to a data that is in the XML fonn that is defined in
(7).
9. Write a HTML page that displays the data available for the online vendor
in (7).
Ramesh R. Sarukkai
Chapter 3
Networking
To the blind pen the hand that writes is unreal, its writing unmeaning
Abstract: What are the underlying protocols used by the Internet? The Internet Protocol
addressing scheme for representing endpoints is discussed, along with the
TCPIIP protocol suite. The purpose of Distributed Name servers, and the
notions of URLIURI are presented. The notion of Web servers, web proxy,
clients, and how they communicate with each other using HyperText
Transmission Protocol is discussed. Secure communication is an important
aspect of communication on the Web, and Secure Sockets Layer, Public Key
Infrastructure, and certificate authority notions are illustrated.
Keywords: Internet Protocol, IP, DNS, TCPIIP, URI, URL, HTTP, Web servers, clients,
Web proxy, SSL, HTTPS, digital certificates, Public Key Infrastructure (PKI).
1. INTRODUCTION
Internet was born out of the ARPA Net project funded by the US Defence
to study the inter-linking of data package technologies across a network of
computers. This project was initiated in 1973, and the research from the
ARPA Net emerged into today's Internet. Over the last three decades, many
technologies and standards have made the Internet possible, such as
Transmission and Internet Protocols (TCP/lP). The power of the Internet was
the ability to connect two different computers on a distributed network. Soon
applications such as File Transfer Protocol (FTP), and Mail emerged and
were adopted quickly. The flexibility of the Internet was the ability to
develop arbitrary communication applications which led to the emergence of
the World Wide Web. The Internet Protocol (IP) protocol named end-points
with four byte location identifiers. The Distributed Name Server architecture
enabled referencing endpoints using an alphanumeric naming scheme.
2. LAYERING OF NETWORKS
• Layer I: Physical
• Layer 2: Data Link
• Layer 3: Network
• Layer 4: Transport
• Layer 5: Session
• Layer 6: Presentation
• Layer 7: Application
The physical layer defines the cable or physical medium of transport. The
data link layer represents the actual format of the data on the network, such
as inclusion of checksum of payload data being transmitted, and source and
destination addresses. The physical and logical transport to the destination is
handled in the data link layer using a network interface. The Network layer is
another layer of abstraction above the data link layer, and protocols such as
Internet Protocol (IP) are popular standards used in this layer (discussed later
in this chapter). The transport layer maps the user-buffer to be transported
into "data packets" of network size. Additional transmission control logic is
implemented in this layer, and examples of protocols (such as TCP and UDP)
are also discussed in a later section. The data format sent over the connection
is specified in the session layer. Conversion of data from local to external
data formats is achieved in the presentation layer. Lastly, the application
layer provides services (such as mail access, file transfer) to end-users.
3. LOCATING ENDPOINTS
3.1 IP
network layer using the IP protocol requires the packaging of data into "IP
packets" or "IP datagrams". The IP datagram consists of a 20-byte basic
header followed by additional optional information (options), and the actual
data that is to be transmitted. This is summarized in Table 14. The 8-bit
protocol field specifies the protocol: a value of 17 is UDP, value of 6 is TCP
etc. The header checksum is a checksum computed on the header only. The
source and destination IP fields specify the source and destination addresses.
Table 14 IP Datagram
Vers
ion
I Length
Header I Type of
Service
Total length of packet (in Bytes)
will run out of IP addresses needed by more computers, in the next few
years. In order to overcome this limitation, new standards of IP addressing
are being defined and called the IP-next generation (IPng) initiative. The
current standard is the IPv6 which allows 128-bit IP addresses, an address
length four times that of the 4-byte address space. In addition to increased
address space, IPv6 includes additional security by providing packet
encryption and source authentication. IPv6 also supports auto-configuration
modes, and is designed for more real-time applications such as video
streaming.
3.2 DNS
The top level domains of DNS include com, edu, gov, net and org. Each
of these top level domains are dedicated to specific organizations: edu covers
educational institutions, gov covers U.S. governmental agencies, com covers
commercial organizations. The top level domains are followed by secondary,
and higher level domains. The domain name is derived by going from the
high level domains down to top level domains, for instance, cs.rochester.edu.
Domain names that complete with a period are called Fully Qualified Names.
3.3 URLIURI
Ports are numbers that enable different applications on the same computer
to communicate separately. Thus, ports can be viewed as entry points for
logical connections. Port numbers generally range from 0-65535, and are
divided into three types:
4. TRANSMISSION PROTOCOLS
At the application layer, the TCP/IP suite has utilities such as telnet, ftp
(file transfer protocol), SMTP (Simple Mail Transfer Protocol) for e-mail and
so on. For transport, UDP and TCP are two methods of transmitting data. At
the network layer, IP is used commonly, while other protocols such as
Internet Control Message Protocol (ICMP) and IGMP (Internet Group
Management Protocol) are also defined. At the link layer, protocol such as
ADR (Address Resolution Protocol) is defined to map the IP addresses to the
data link layer.
60 Foundations of Web Technology
4.2 TCP
TCP packets are on top of IP, so TCP header and data information are
embedded within IP packets as shown below:
Each TCP segment contains the 16-bit source and destination port
numbers. Sequence numbers are also transmitted in every packet. At the
onset of a new connection, the SYN flag is set to 1 and the sequence number
field is used as the starting sequence number. On following transmissions,
the sequence number represents the byte number in the transmission of data
for this connection. Since TCP is full-duplex, the numbering is different for
the simultaneous flow in each direction of the connection. The
acknowledgement number field is used to specify the next sequence number
Ramesh R. Sarukkai 61
a) Retransmission delay - The amount of time that the sender waits for
acknowledgement before retransmitting the packet. If this threshold
of time has passed without an acknowledgement, then the packet is
assumed to be lost. However, the delays over the Internet are
unpredictable, and this delay could vary for different subnetworks.
b) Socket interval - Sockets are typically unused for a period of time
after they have been closed. Setting this interval (also termed as
62 Foundations of Web Technology
Other factors that influence TCP performance are maximum segment size
that can be transmitted, and the keepalive interval. The latter corresponds to
the maximum time that the two ends of a TCP connection should be
maintained even though there is no data flowing between them. The
Maximum Segment Size (MSS) is the largest data size that the TCP segment
can send. As mentioned earlier, this can be configured as an option in the
first SYN TCP packet. Large MSS are generally better for large data
transmissions (unless there are fragmentation issues).
How does TCP do flow control? The key aspect of TCP is that the
receiver acknowledges receipt of a data packet to the sender. If the sender
fails to receive a threshold of packet acknowledgements from the receiver,
then the sender assumes failure and resends the data packets. While this is a
valid solution in some cases, it amplifies the problem in other situations. For
instance, if a link is congested, and the router drops packets, then the receiver
will not send acknowledgements for those dropped packets. The source in
tum will detect that the acknowledgements have not been received and
resend the data packets making the congestion worse. Two variables
congestion window, and receiver window are used in conjunction to control
rate of transmission. A TCP must not send data with a sequence number
greater than the sum of the highest acknowledged sequence number and the
minimum of the congestion and receiver window variable values. Thus
packet drop can be used as an indicator of congestion, and the delay and
absence of ACKs can be used to actually reduce the number of data
transmissions to the receiver. However, such congestion control schemes
have drawbacks: estimating congestion by retransmission may not be
meaningful in certain cases such as wireless networks.
Ramesh R. Sarukkai 63
Let us summarize what TCP offers in addition to the IP layer. TCP directs
packets to different ports on the destination for different processes to use.
TCP verifies the validity of the data being transmitted by computing the
checksum on the message. TCP uses ACK's and re-transmit procedures to
detect possible delivery failures, and guarantee delivery of the packet, in
addition to sequential ordering of data segments. By "guaranteed", we do not
mean that using TCP will never fail. TCP can deliver the data only as long as
the connection holds up. For instance, in cases where clients are assigned
dynamic IP addresses, TCP can break down if the connection goes down,
even if the client immediately reconnects.
4.3 UDP
The User Datagram Protocol (UDP) is also built on top ofIP at the transport
layer. However, unlike TCP, UDP is connectionless and does not support
reliable delivery of the packet from the source to the destination. Since UDP
does not have the overhead of reliable delivery that TCP has, UDP has better
performance. Typically, when UDP is used, the application layer is
responsible for reliable transmission of UDP packets. UDP is a datagram
oriented protocol. UDP applications send UDP packets to the destination,
and there is no acknowledgement at the transport layer. Thus, it is easy to see
why UDP is unreliable: a packet may be sent, but there is no way of knowing
if it was actually received, and if it needs to be resent. UDP packets are built
on top ofIP, including a UDP header and UDP data.
ams.
UDPData
UDP checksum is over the UDP header and data, while the IP datagram
checksum is over the whole IP packet. The checksum is optional in UDP,
64 Foundations of Web Technology
5. CLIENT/SERVER
5.1 Architecture
Client 1
Web
Client 2 Server
The server typically listens for incoming requests at a port. A client that
needs information served by that server connects to that server, and requests
the appropriate information. This request/response protocol is generally
HTTP or HTIPS for the Web (these protocols are discussed in later
sections). The server processes the request, and returns the information back
to the requesting client. This server "processing" can include checking of a
database, contacting another server, and computation using algorithms to
Ramesh R. Sarukkai 65
provide desired functionality. For the Web, the server generally needs to
handle requests from a large number of simultaneous clients.
Original
Client reques~
Proxy ...
Forwarded .:tq U Sl
Server
... y ...
A .......response ""'response
" X
I We use the tenus "origin/original/source server" to refer to the server that is the origin for
the requested resource.
66 Foundations of Web Technology
request and sends the result back to proxy Y. Proxy Y, in turn, returns this
response to client A. Clearly, nothing is gained if this is repeated for every
request, since ultimately every client request needs to be processed by the
original server. In practice, a number of the client requests can be stored or
cached for a certain period of time. This eschews the need to forward the
request to the original server every time, thus reducing the request load on
the original server. Thus, in the example above, if client A resends the same
request later to proxy Y, then proxy Y can check its local "cache", and return
the earlier saved response if its still valid. This eliminates the need to contact
the original server for this client request. The price that one pays for using a
proxy is the added complexity involved in maintaining a copy of the data
and ensuring the validity of that data.
Extensive research has been done on analysis of web server workloads (e.g.
[Barford & Crovella 1999]). In general, statistical properties of web client
workloads tend to exhibit high variability. Statistical properties include
properties such as file sizes, transfer times, and request inter-arrival times. A
commonly cited model to represent the relationship of documents and their
frequency of use is the Zipf distribution. Zipfs law [Zipf 1949; Mandelbrot
1983] was used to model the relationship between a word's popularity in
terms of its rank and its frequency of use.
chains [Sarukkai 2000] on server client request traces are discussed in a later
chapter on Web mining.
In the Cache Array Routing Protocol (CARP) , the URL space is divided
among an array of loosely coupled proxy server. The goal of hashing URLs
to a particular cache server in the cache array eschews redundant caching of
68 Foundations of Web Technology
the same documents, and hit rates improved. The two components of CARP
are a proxy array membership table, and a hashing function plus a routing
algorithm for mapping the requested URL.
With millions of request every second made to large scale web sites, how
do you distribute these requests across many servers and balance the request
load on each server? How do you ensure fault-tolerance by having replication
of servers in case of server failures? Proxies can alleviate the situation, but
are only a part of the solution. This is because a proxy only acts as an
intermediary between the client and the original server, but can be a point of
failure in itself. Furthermore, as mentioned earlier, proxies also have their
disadvantages such as added complexity, and increased latency (e.g. in case
of a cache miss; i.e. document not found or expired in proxy cache).
Even with mirroring, load balancing at a single point of entry into a group
of web servers is essential. Load balancing of web servers is typically done
in two methods: DNS round-robin, and hardware based routers/switches. In
the DNS round-robin approach, the domain name server responds to
translation requests with a rotation of IP addresses of different hosts in a
Ramesh R. Sarukkai 69
processes from the process pool. However, after the client request is
processed, the process is not destroyed. Rather, its added back to the pool of
free processes in order to handle the next request. This approach has the
advantage of processes while eliminating the cost associated with repeated
process creation and destruction. Although the overhead of process pool
management is incurred, this has proven to be an effective practical solution.
FastCGI is an example of such an implementation. Mod""perl is another
variant that embeds a Perl interpreter inside the web server in order to speed
up CGI processing.
to the client, and the connection is closed. This procedure of request and
response continues for different clients or repeatedly from the same client as
needed.
HTTP/l.O 200 OK
Date: Mon, 1 Jan 2002 23:59:59 GMT
Content-Type: text/html
Content-Length: 123
Ramesh R. Sarukkai 73
<HTML>
<BODY>
<H I> Welcome to my homepage </H I>
</BODY>
</HTML>
The GET command indicates that the client is requesting for the URL
content from the server. Two other commands allowed in HTTP/l.O are
POST and HEAD. POST is used when the client wants to submit data to the
server. HEAD is used when the client needs header information about the
URL resource. HEAD is useful when the client needs to verify if it's copy of
the requested resource is outdated or not. If the date of modification has not
changed since previously accessed and cached, then the lightweight HEAD
request saves the client from getting the whole document again.
The first line of the response contains the HTTP version and the status of
the request. For instance:
HTTP/l.O 200 OK
This is followed by the actual data or resource requested. Some of the
common returned error codes in HTTP are summarized in Table 18. In
general, Ixx error codes refer to informational codes, 2xx successful codes,
3xx redirection, 4xx client errors, and 5xx server errors.
The header fields supported by HTTP/I.I are listed in Table 20. In the area
of security, HTTP/I.l enhanced the challengelkey security mechanism
provided in HTTP/I.O by adding a digest authentication and proxy
authentication, so as to avoid transmission of username and password in the
clear (discussed in a later section on web security). Lastly, HTTP/1.l also
improved on the content negotiation features provided in HTTP/l.O by
providing server-driven and agent-driven negotiation mechanisms.
could include passing around a hidden field variable that is passed with every
request back to the server. Another approach is to use the IP address of the
client. However, IP addresses can be spoofed, multiple clients can share an
IP address (e.g. though a Internet Service Provider), or IP addresses can be
dynamically assigned to a client. A second solution is the client side storage
of "cookies". A cookie is just a name-value pair that is restricted to a domain.
For instance, if a cookie called "mystate" is issued in the domain
".yahoo.com", then only servers in the .yahoo.com domain can set or modify
the cookie. The client software (such as a web browser) can submit such
cookies whenever a request to a server in that domain is sent. Since the
cookie is passed back and forth between the client and server, state
information can be maintained. The cookie can be set by sending the cookie
name and value as a part of the HTTP header. Cookies on the client side are
usually managed and stored on the local machine by the Web browser.
Whenever requests to that URL are sent from the client, the cookie is also
passed along in the HTTP header, and this can be processed by the server and
reset by the server to be passed in the next request. Since cookies allow the
storage and transfer of information about an user's previous requests2, it is
important to ensure that the security and the privacy of the user is not
sacrificed. Combination of cookies, IP and session identifiers are generally
now used for maintaining state information.
7. WEB SECURITY
7.1 Basics
2 Cookies, in many ways, ended privacy on the Web. Browsers do allow options of limiting or
even disabling cookie submissions, but many applications rely on cookies for proper
functioning.
Ramesh R. Sarukkai 79
to decipher it without the key. The server needs the key in order to decipher
the encrypted message to retrieve the original message. The limitation of this
approach is the need to make sure that the key is securely shared.
Network
CLIENT
.... Encrypted
Message
..
....
SERVER
I
I
I
~.. .....~
....
...
r-------I
1 KEY I
1 1
~-------
This approach is also useful when one party wants to send information
back and forth to itself. Since HTTP is stateless, it is often useful for the
server to pass state information to the client and have the client submit it
back to the server for future use. In such a scenario, the server can encrypt
this information with keys that reside on the server side, and later can decrypt
the information later using the same keys. Thus, the key is securely stored on
the server side and does not need to be transmitted to the client.
encrypted using his private key. Upon receipt of the message, the receiver
can successfully decrypt the message only if it is not modified. This assures
the receiver of the integrity of the transmitted message.
While the use of public and private keys ensures that the message has not
been modified, it does not ascertain that the message is actually from who its
supposed to be from. In order to address this issue, the notion of 'symmetric
keys' was introduced: if one key is used for encryption, then only the other
can be used for decryption. This notion of symmetric keys is very powerful
in ensuring not only the integrity ofthe data that's being transmitted, but also
the authentication of the sender. Let us now see how the notion of
'symmetric keys' helps in enabling a secure transfer of infonnation. This is
best illustrated with an example.
Lets assume that encryption and decryption with symmetric keys is like
having a lock that can be locked with one key but only unlocked with the
other. B wants to send some valuables to A. First A locks a box with his
private key and ships it to B. B receives the box, unlocks it with the public
key of A. B puts another box inside this one which is locked with the private
key of B. Then he locks the outer box with the public key of A, and ships it
back to A. Now A unlocks the outer box with his private key, and unlocks
the inner box with the public key of B, and gets the valuables that B sent
him. When the valuables were transmitted from B to A, a thief needs the
private key of A in order to break into the boxes. Furthennore, if someone
else has stolen the box that A sent to B, and locked it with their private key,
then A would know this since A would use the public key of B to unlock the
inner box.
This is exactly what happens with public and private keys. Digital
certificates are essentially keys stored in files. Digital certificates are
produced by an established certificate authority. Digital certificates also
serve the purpose of vouching for a third-party. Let us consider the case
when party C wants to send data to B. If C sends its public key to B, how
does B know for sure its from C. B trusts A, but not C yet. If A encrypts C's
public key using its private key, and sends it to B, then B would trust the
validity of C's public key. Certificate authorities such as VeriSign do this
and offer different grades of digital IDs.
82 Foundations o/Web Technology
7.4 Non-Repudiation
8. PRIVACY
At the basic level, any serious web site must publish a privacy policy
statement and abide by it. The Platfonn for Privacy Preferences (commonly
called P3P) is an emergent standard that enables a fonnal mechanism of
specifying privacy policies and communicating that with the user or a
privacy policy agent. For instance, a P3P agent can be built into a browser,
and can aid the user by popping up a alert (for instance) if the privacy policy
setting for the user has been violated.
P3P1.0 [WWW P3P] discusses the details of the P3P version 1.0
standard, and we'll highlight the basic points. Every web site must have a
standard location where their privacy policy is accessible (such as
/w3c/p3p.xml) or embedded in the link tag in the HTML document. P3PV1.0
describes in detail the various tags allowed in the P3P CML file, and how
they are interpreted. The goal is to allow P3P agents extract the P3P XML
description, and be able to present or integrate with the user's privacy
requirements. Additionally, privacy attributes may be specified in HTTP
responses from web servers. The policy file can specify details such as how
user requests and cookies are used, when the privacy policy expires, and
which zone the policy applies to.
9. CONCLUSION
The growth of the ARPA net led to the development of the Internet which
is the infrastructure of the World Wide Web. The basic notion in any
network is that of an "endpoint". Communication occurs between
endpoints. In the case of the Internet, the endpoint is specified as a 4-byte
IP address. Resources on the Web are identifiable using the URL fonnat.
Ramesh R. Sarukkai 85
The Domain Name System (DNS) maps the URL to a specific IP address.
TCP/IP protocol suite provides the transmission and control formats and
protocols for exchange of data packets in the Internet.
FURTHER READING
EXERCISES
3) Design and implement a simple server that listens on port 80. Whenever a
client connects to the server, the server waits for the client to write the string
"hello", upon which the server responds with the string "Got the message.
Goodbye." to the client and closes connection.
4) Write a simple client program that interoperates with the server in exercise
(3).
7) Convert the server in exercise (5) or (6) to a simple HTTP server that
handles the HTTP commands "GET" and "HEAD".
10) Discuss how secure communication is achieved between the client and
server using HTTPS.
11) The apache server (with cgi-support) can be downloaded and installed
from www.apache.org. Write a CGI script in Perl that accepts two integers
and computes the lowest common factor of the two integers. Design a HTML
page to submit the two values, and format the result from the CGI Perl script
usingHTML.
PART II
APPLICATIONS
Chapter 4
Information Retrieval
Abstract: What are the techniques used for the representation of infonnation contained in
large collection of data? How is a query matched with the stored infonnation
in order to retrieve relevant documents? The architecture and techniques
behind text infonnation processing are presented in this chapter. Algorithms
for document analysis (such as latent semantic indexing (LSD), query
expansion and notions such as document spaces are discussed with examples.
Keywords: Text Analysis, document spaces, indexing, search, query expansion, ranking,
precision-recall graph, latent semantic indexing, TF-IDF measures.
1. INTRODUCTION
Infonnation retrieval is the study of methods for representing
infonnation, and mechanism of locating specific parts of the stored
infonnation in response to a query. How do you fmd a sports article in a
newspaper? Probably by just going to the sports section and picking an
article. How do you pick an article on baseball? Perhaps by locating the
keyword baseball in sports articles. What about identifying an article
discussing the future of tennis? The complexity of the identification and
retrieval of documents increases with the amount of data and the constraints
on the query. It requires analysis of the document, and an "understanding" of
the content of the article. Imagine, having a book with a million articles, and
asked to find articles that are relevant to gravitational physics. What are the
steps involved in designing methods for computers to process huge amounts
of data, and extract relevant subsets of the data in response to a query from
the user? How does one condense vast amounts of data into useful compact
fonns? The field of infonnation retrieval addresses such questions, including
visualization, analysis and processing of infonnation. Although text
processing is primarily covered in this chapter, infonnation retrieval covers
other types of data such as audio, visual and multimedia.
2. COMPONENTS OF IR SYSTEM
OUTPUT
TO
USER
... -
... .....
",.,
., ".
,
..,
,
,,
I DATABASE
. REPRESENTATION
RETREIVED
DOCUMENTS
''- ......
'-
Let us now turn our attention to the basic block diagram of a generic
information retrieval system. A lot of the discussions will be around text
retrieval since that is one of the most common problems in information
retrieval. It should be noted that there are many other non-textual problems
that fall in the category of information retrieval. For example, a user may
show a picture and ask the system to retrieve "similar images". This would
involve a lot of image processing and visual pattern recognition, but can be
posed in the information retrieval paradigm.
The other modules (above, and to the right of the database box)
corresponds to the actual "information retrieval phase". The user requests for
some information by giving the system a query. The system parses the query
and does some transformations on the user's query, and retrieves relevant
information from the database. Typically, the information retrieved is large
for huge data, and this retrieved data should be further processed in order to
determine the relevancy/ranking of goodness to the user's original request.
The transformation operations allow the system to clean up the data. The
usual sets of transformations include tokenization, and document/term
operations. Term operation methods include stemming, stoplists, query
expansion with thesaurus, truncation, weighting and so on. These methods
are discussed in the following sections.
3. TEXT PROCESSING
3.1 Tokenization
Document analysis and retrieval methods work using a set of basic units
or tokens. A token can be arbitrarily defined depending on the context of the
information analysis and retrieval task. In text analysis, words are good basic
units upon which to base analysis, and indexing on. Tokens can be defmed at
sub-word or multi-word levels. In other tasks such as image retrieval, basic
units could be lines or boundaries in the image.
3.2 Stoplists
Such lists of words that are excluded in deriving the actual representation of
the infonnation are called "stoplists", and are discarded during the initial
stages of processing.
3.3 Stemming
Stems refer to morphological roots of words that are useful for indexing.
The stemming program clusters words into morphological classes. For
example, the words "musical" and "music" can be stemmed to the root
"music". Of course, some infonnation is lost, but such a reduction allows IR
systems to build compact systems with lesser index tenns. Reducing the
number of index terms has many advantages: better trainability, reduction in
the size of index files, and faster search/retrieval.
The similarity between two words 'wI' and 'w2' can be computed using
Dice's coefficient as follows:
Here, 'c1c2...cn' are the letters that occur before the letter 'j'. The
entropy at that position for the string 'c1c2..cn' can be defined as
follows:
Pick a word, say 'users', and estimate probabilities and entropies at each
letter position.
P( s I user) = 1 ; H(user) = 0
3.4 Thesaurus
Clique:
Syn: cabal, camarilla, camp, circle, clan, coterie, in-group, mob, ring
The thesaurus allows both to expand the user's query and perhaps even
reformulate/weight the terms. If the search retrieves too many items, the
thesaurus can be used to limit the query to "important" terms. On the other
hand, if the search does not retrieve enough documents, then the thesaurus
can be used to expand the query and thus expand the search. In the above
example, the query of "clique" can be expanded by searching for the
synonym words specified by the thesaurus. However, there is a risk of
deviating from the specificity of the original query. Expanding the search to
include "ring" for an original query of "clique" may result in unrelated
results with "ring" in a different context or sense (e.g. "wedding ring").
text retrieval work suggests that words in the mid-frequency range are good
indexing terms (i.e. not too frequent or infrequent). Another approach is to
evaluate terms based on this discriminability- the discrimination value with
and without the word are measured and the terms with positive
discrimination value are retained in the thesaurus. Frequently occurring
phrases are also extracted and added to the thesaurus. Some other
researchers have also used co-occurrence values to construct phrases. In the
last step of thesaurus construction, the vocabulary is organized in the form of
a hierarchy based on the statistical similarity distances between terms.
Now, let's see how the inverted file helps in retrieval of relevant
documents. If an user searches the text database with the keywords "beauty,
symmetry" in close proximity to each other in the text, then the first step is
the lookup of the keywords in the vocabulary of the text. Next the
occurrence list for each index is retrieved for analysis. Lastly, operators and
constraints are applied to produce a result. In our example, the occurrence
list for "beauty" is {1,32,45,56 }, and the occurrence list for symmetry is {3,
34,95, 120}. Next, enforcing a proximity constraint (for e.g., the two words
should be within a distance of 5 to each other) results in the occurrence list
{1,32} for "beauty" and "symmetry" within 5 words to each other. Since the
vocabulary for large text databases can be in the thousands, and the number
of occurrences can be large, the inverted index stored on disk is divided into
two parts: a sorted vocabulary file with information on the location of this
index in the inverted index, and the inverted index file. Merging of two
inverted files would require a sorted merge of the vocabulary, in addition to
merging of the inverted list occurrence list for duplicates words. A drawback
of inverted files is the assumption that the text is a sequential combination of
words. Searching for sequences of words, such as phrases, involves a
complicated match of the occurrence lists for each of the individual query
words.
positions represent index points. The tree is created by starting at each index
position, and traversing one character at a time. If there is an arc labelled
with the character at that position, then the traversal moves to the next node.
If there is no label with the character seen at that position, then a new node is
created, and the label of the arc between the two nodes set to the character at
the current position. The leaf nodes store the word position of the word
that's generated by traversing from the root node of the tree down to the leaf
node. The tree can be compressed by merging nodes with single outgoing
arcs. Such a tree is called a prefix tree since the tree stores the prefix of each
word at the indexed positions.
Thus, the tree is created by starting at character position 4. The first arc
from the rot node will have the label "e", the second arc from that node has
the label "s", and so on. The actual number of labels can be limited to a
specific length of characters (or less if the word is shorter than the limited
length). Next, the character sequence for the word "happiness" at character
position 28 is added to the tree by starting at the rot node, and traversing
down the tree, matching each character with the label on the arc, and
creating a new arc if the character is absent. The resulting tree is shown in
Figure 13.
Searching a prefix tree consists of traversing the tree from the root node
with each character of the query word until a set of matching leaf nodes are
reached, where the character positions of the prefix are stored.
The idea behind signature files is to divide the document into blocks, and
associate with each block a signature that identifies which words may be
present in that block. This signature is in the form of a bitmap that's
generated using a hashing function applied to each word in the block. In the
example below, each block of text is enclosed in square brackets.
[The new version ofthe software was just released.] [Windows XP is the
latest version of software released recently.][What software release lies
ahead?]
In order to create the signature file, each text block is hashed using the
hashing function to generate a signature. Typically, the bitmap of the words
of the block are bitwise-ORed to generate the signature for the block. Thus
the signature bitmap for the first block of the example is: [011110]. Thus,
the signature file consists of a sequence of bitmap signatures with the index
pointer to the corresponding text block in the original document.
5. RANKING
Given a document Di and a word wj, TF and IDF are defined as follows:
Dn
IDF(wj)=--
D(wj)
The ranking metrics of all the terms in the user's query can be weighted
and combined to give a total TF-IDF metric that can be used for ordering
Ramesh R. Sarukkai 101
5.4 Illustration
Table 27. Sample documents for Vector Space illustration.
Mystery Document 1 Mystery Document 2
The command is simply an Holmes stretched out his hand for
executable program. The shell reads the manuscript and flattened it upon
typed commands and executes his knee. "You will observe, Watson,
programs. UNIX has a file system the alternative use of the long sand
arranged as a hierarchy of the short. It is one of several
directories. This manuscript is indications which enabled me to fix
divided into several chapters. the date."
Let us assume that the indexing terms extracted from the sample
documents shown above are: {UNIX, manuscript, Holmes, Watson}
102 Foundations of Web Technology
TF(Doc. 1, "manuscript") = Y2
TF(Doc. 2, "manuscript") = 1/3
lDF("manuscript") = 2/2 = 1
Thus,
W(Doc. 1,"manuscript")
= W(Doc. 2, "manuscript") = 0;
TF(Doc. 1, "UNIX" ) = Y2
TF(Doc. 2, "UNIX" ) = 0
lDF("UNIX") = 2/1 = 2
W(Doc. 1, "UNIX" ) = Y2
Similarly,
TF(Doc. 1, "Holmes" ) = 0
TF(Doc. 2, "Holmes" ) = 1/3
lDF("Holmes") = 2/1 = 2
W(Doc. 1, "Holmes" ) = 0
W(Doc. 2, "Holmes" ) = 1/3
Thus the vector space representations for the two documents are shown
in Figure 14.
UNIX 1/2
UNIX 0
manuscript 0 manuscript 0
Holmes 0 Holmes 1/3
Watson Watson
0 1/3
Doc. 1 Doc. 2
Now, let us examine what happens when a user queries with the keyword
'Watson'. The query can be represented in the form of a vector I (note that
the last column corresponds to the word "Watson"):
[0 0 0 1]
The dot product between the input vector I and the document vector
gives a measure of the match between the query and document. Thus, the dot
product between document I and the input query = 0, whereas the dot
product between document 2 and the input query = 1/3. This distance is also
called the cosine correlation distance, and commonly used to measure
similarity between vectors. Thus, in this example, document 2 will be
retrieved in response to the query word "Watson". Thus, the query words
can be used to construct the input vector to be used in matching.
5.5 Discussion
6. QUERY OPERATIONS
6.1 Purpose
6.5 Example
7.1 Motivation
Document I: [ I 0 0 I I 0 0 0]
Document 2: [ I I I I I 0 0 0]
Document 3: [ I 0 I I I 0 I .,. I]
Ramesh R. Sarukkai 107
UNIX 1/2 0
Manuscript 0 0
Holmes 0 1/3
Watson
0 1/3
matrix, and V is the right singular vectors of dimension (N x R). The matrix
S is by definition positive definite, and the matrices U and V are unitary.
Note that the resultant matrix is an Approximation when R« N,T.
The term "Latent Semantic Indexing" was coined to indicate that the
method tries to extract the most important (in a mathematical sense)
"semantic" information represented in the matrix A. LSI is not only useful in
terms of efficiency of computation and storage, but also improves the
retrieval quality in many cases by automatically filtering "noisy information"
(as a by-product of eigen value ordering). LSI may also assist in solving the
problem of synonymy: different terms describe the same underlying
concepts. By projecting into the most prominent sub-spaces in a non-linear
fashion, some of these difficult to extract relationships may be automatically
captured. Traditional vector space models assume term independence, but in
reality many terms are related to each other, and show strong associations.
Such term dependencies are often minimized in the projected space.
Drawbacks of LSI include huge storage requirements and computational
costs, in addition to assumption of normally distributed data.
8. EVALUATION METRICS
8.2 Precision
8.3 Recall
Precision
o
Recall
9. CONCLUSIONS
In this chapter, the reader has been introduced to the field of information
retrieval. The problem of information retrieval is to retrieve documents that
are relevant to a user's query. Various steps involved in the design of an
information system have been described.
FURTHER READING
EXERCISES
Take two sections from your favourite newspaper (e.g. sports and
business). Pick two representative paragraphs as text data for exercises 1-4.
1) For the two paragraphs chosen, list the words and frequencies of
words that occur in the paragraphs. Study the distribution of words, and see
if you can identify index words that can distinguish the two paragraphs.
2) Apply stop-word elimination, and build an inverted index into the two
paragraphs.
3) For each of the index words, compute the TF-IDF measure with each
of the paragraphs, and represent as a matrix of section versus index words.
4) Use the matrix computed in (3) to determine the section that's matches
the most for each of the index words used as a query. Repeat for pairs of
index words.
5) Discuss the idea behind Latent Semantic Indexing. What are the pros
and cons?
7) The set of document relevant to a query Q is {Dl, D2, D3, D4, D5}.
The set of documents retrieved by an IR system is {D3, D4, D5, D8}.
Compute the precision and recall.
The world is the ever-changing foam that floats on the surface ofa sea ofsilence
Abstract: Search and directory are some of the earliest applications of the World Wide
Web. In this chapter, the general architecture of a Web search system is
presented. An important aspect of a Web search system is the Web crawler
that crawls the World Wide Web by following links and storing this
information for processing. The issues in crawling, and how queries are used
to retrieve documents is presented. Variants in search such as focussed
crawling, meta-search, and dynamic search are also discussed. An overview of
Web directory systems and different methods of constructing Web directories
are described. Ranking algorithms such as PageRank™ and web 'social
structure' extraction using Hub/Authority analysis are also covered.
Keywords: Web Search, Web Crawler, Ranking, PageRank, Hub, authority, dynamic
search, focussed crawler, topic distillation, web directory, web classification
taxonomy, automatic taxonomy generation, relevance feedback.
1. INTRODUCTION
At the onset of the World Wide Web, systems indexed a few hundred
thousand documents, and searching of these documents did not pose major
technical hurdles. In fact, one of the earliest Web search engines called
World Wide Web Worm (WWWW)[McBryan 1994] indexed just 110,000
pages and accessible documents. However, with the rapid growth in the
number of web pages, and the increased amount of dynamic or changing
content, web search and directory systems are somewhat challenged. Over
the last five years, improvements in the technology have enabled building
directories out of hundreds of millions of documents, periodic crawling of
the Web to find new documents and modify information indexed in
previously traversed documents, and provide the ability of searching through
the vast amount of information from all around the world. Such search and
directory systems should be able to handle billions of requests at major sites
today. This chapter summarizes some of the basic architecture of web search
and directory systems, discusses the challenges, and outlines areas of future
improvements.
2. WEB SEARCH
The goal of a Web search system is to collect data from the World Wide
Web, index this data, and extract relevant documents from this database in
response to a user's query. The first step is the collection of the pages from
around the Web. With the Web containing hundreds of millions of
documents, this is an expensive task both in terms of computation and
storage. The Web crawler is the component of the Web Search system that
performs the crawling. Crawling is essentially exploring the Web in order to
find documents and their contents. The output of crawling is list of URLs
and the retrieved documents that are stored in a compressed manner. The
Ramesh R. Sarukkai 117
The next component of the search system is the indexer. The purpose of
the indexer is to create indices from the crawled documents, so that in
response to a query, an efficient lookup can be done to retrieve documents.
Thus the indexer performs tasks such as parsing of the documents, extracting
anchor texts (associated with links), generating the lexicon, and other such
related tasks. The steps of crawling and index generation can be viewed as
the "representation phase" of a Web search system.
The second phase of the search system is the retrieval. The core
component of the retrieval phase is the search module. A user's query is
processed (with techniques such as query expansion) and submitted to the
search module. The search module uses the inverted indices created by the
indexer and retrieves the relevant set of documents. These documents are
then ranked using various algorithms. Figure 17 summarizes the overall
architecture of a Web search system.
Crawler Storage
Server
Indexer Doc.
DB
Query
1---'" User
\
Figure 17. Overview of Web search system.
118 Foundations o/Web Technology
While the basic notion of web crawling is fairly straightforward, there are
two important difficulties inherent in the problem of crawling the World
Wide Web. The WWW contains hundreds of millions of pages, and growing
at a rapid rate. Even with the most powerful mUlti-processor systems,
prominent web crawlers can cover only around 30-40% of the WWW.
Furthennore, the time and memory requirements to crawl are overwhelming.
The duration of a crawling cycle takes anywhere from a few weeks to
months.
• Crawler Architecture
4 FIFO stands for "First In First Out" and LIFO stands for "Last In First Out". In the context
of crawlers, this applies to the order in which un-traversed URL links are stored and
processed further.
Ramesh R. Sarukkai 119
: m.Portul,. M_lriu
rrocus"1:
~
~~l~~:';
1="
1 4 - - - - , - - - CRAWL
rF---~-;;;;;_~...,
:- :
. . . . . - - - - - - ' ; SUDUau :
,----.L.._---, , :
2. While URL Queue not empty and Crawler Limits not reached
9. Ifany outgoing link is not marked as "done", then add URL to Crawler Queue and repeat
from step 3.
The steps of the basic crawling algorithm are summarized in Table 28.
The first step of a crawler is the initialization phase, wherein all the relevant
data structures are reset, and initialized. The crawling procedure begins
when a list of "seed URLs" are provided. The crawling begins by accessing
these seed URLs, and proceeds with the algorithm. The seed URLs are
usually put in the crawler URL queue. Then the main thread is started. This
master thread checks the URL queue, and retrieves the next URL to be
processed. If the number of threads is within the specified limit, then this
URL is passed to one of the "crawler-slave" threads. Each "crawler-slave"
thread communicates with a web server and accesses the document. Some
form of rudimentary parsing can be done at this stage itself, depending on
the actual needs of the crawler.
The next step of the document analysis is parsing the document. At the
simplest level, the meta search tags can be used to classify each document.
However, many documents are not tagged or improperly tagged, and further
analysis of the document is often needed. Researchers have also proposed
analysis of the actual link, and the anchor text for a link in order to get
relevant information about the document itself.
Ramesh R. Sarukkai 121
After the processing of the document is completed, this URL node in the
crawl graph is marked as visited. The children/outgoing links from this
document are checked to see if there are any unprocessed links. The
unprocessed links are then put into the crawler URL queue for later
processing.
Since the crawl graph essentially represents the whole structure of the
WWW (or the parts that are represented by the crawler), it is important to
compact the individual structures. For instance, many of the above attributes
can be packed tightly either using manual coding or coding algorithms such
as Huffman coding. Since the children URLs share prefixes with the parent,
the crawl graph structures memory requirements can be decreased
substantially by having a delta representation of the URLs. Another
important aspect to consider in a crawler is DNS caching. A large portion of
the time is spent in DNS resolution of the URLs, and having a DNS cache
significantly speeds up crawling time.
• Parsing
The next step of the indexing phase is the creation of forward, and
inverted indices. As an example, we will illustrate the data structures
required by using the anatomy of the early Google prototype search system
[Brin & Page 1998]. The Goog1e engine maintains the following data: hit
1i$ts, forward index, and inverted index. A hit list is a "list of a particular
word in a particular document including position, font, and capitalization
information"[Brin & Page 1998]. They use two notions of hits: fancy and
plain. Fancy hits correspond to matches in URL, title, anchor text or meta
tags, whereas plain hits represent other matches. The forward indices are
"partially sorted" and stored into distributed databases called "barrels".
Each word in the lexicon is assigned a wordID, and each documentlURL is
assigned a docID. Each barrel corresponds to a range of wordIDs, and the
list of docIDs that have words in that range are stored in that barrel as a
sequence of hit lists.
mechanism of refinement. It will soon be clear why the inverted dodD index
lists have been maintained.
The retrieval phase takes the set of query words and using the inverted
dodD index lists, retrieves the set of dodOs that contain the queried words.
These are then merged into a final set of dodOs that contain all the query
words, and ranked to generate the result of matching documents. These steps
are summarized in the algorithm below:
frequently crawled). Despite such heuristics, the dynamic nature of the web
documents poses a problem for the search engines and results in the retrieval
of documents using outdated data. Furthermore the WWW links could also
be "dead", i.e. no longer in existence, and the crawler/search system needs to
cope up with such cases efficiently.
Web search can be an interactive process. Not much research has been
done in this direction. Grouper[Zarnir & Etzioni 1998] clusters the search
results and presents the clustered results for the user to pick from. On
another front, the relevance feedback given by the user can be used to
improve/adapt the document models and/or term weights[Gudivada et al
1997].
(vii) Spamming.
Another important aspect that makes the development of useful and
accurate search engines difficult is the notion of "spamming"[SciAmer
1999]. The web page designer includes a list of keywords (such as
"cheapest, best sale, best book" etc.) in order to try and deceive an indexing
system into associating the document with those keywords. Sometimes, the
developers write these words repeatedly over colors that are invisible to the
readers. One solution is to discard terms that have either very high or very
low frequencies. Spamdexes will be a part of the (unusually) high frequency
terms, and thus be discarded or reweighted.
3. VARIATIONS IN SEARCHING
3.1 Meta-Search
Search Engine A
Que Meta-Search
Engine Search Engine B
Results L..-. -...J
Search Engine C
A dynamic search system is one that actually fetches documents from the
WWW in response to a query for relevance analysis. On the other hand, a
static search system uses a precompiled infonnation repository and finds the
best match to a query. Example systems implementing dynamic search are
FishSearch[De Bra et al 1994] and WebGlimpse[Manber et al 1997].
128 Foundations of Web Technology
4. RANKING
It is often useful to rank the utility of links that are outgoing from a
particular document. For example, if we want to build a IR system that
retrieves medical web pages, then links to local bakeries from a user's
website are often uninformative. It is important to screen out the
uninformative links from the useful/relevant links.
This measures the number of links that have a link into this document.
This is defined as the ratio of number of links to this document divided by
the total number of URLs.
[Brin & Page 1998] describe the PageRank algorithm used in the Google
search engine which utilizes the link structure of the Web to calculate a
quality ranking. The PageRank algorithm is motivated by the concept of
citation importance and is defined as follows[Brin & Page 1998]:
PR(A) refers to the Page Rank of web document 'A'. T1 ..T n are links that
point to the document 'A'. C(A) is defined as the number of links going out
of document 'A'. More recently, [Henzinger et al 1999] have used the
PageRank function in conjunction with random walks on the web to estimate
the quality of web search engines.
The location metric uses the URL itself as an indicator of the importance
of the document, not the contents. Thus a crawler searching for professors
homepages might just look for the "* .edu" URLs and assign them better
values.
h(s) = La(q)
s-'>q
a(s) = Lh(q)
q-'>s
Table 30 and Table 31 trace the values of the authority and hub scores
respectively, for these nodes at different iterations of the HITS algorithm.
After a few iterations, the system converges to the values shown in the last
row of Table 30 and Table 31. Hcan be seen that node w2 has the highest
authority score, while node wO has the highest hub score. This is intuitive
from Figure 20 since node w2 has the maximum incoming arcs. wO has 2
outgoing nodes (same as node w2), but the nodes it points to have better
authority than any other node.
132 Foundations of Web Technology
Table 31. The Hub scores for iterations of the HITS Algorithm
Iteration WO WI W2 W3 W4
I 0.20 0.20 0.20 0.20 0.20
2 0.29 0.14 0.29 0.14 0.14
3 0.33 0.D7 0.20 0.20 0.20
4 0.35 0.04 0.26 0.17 0.17
17 0.37 0.00 0.21 0.21 0.21
5. WEB DIRECTORIES
Web
EXPERT LABELLED
CATEGORIES
Documents
DIRECTORY
TAXONOMY
CLASSIFIER
Document
Clusters
repeated until a split cannot be achieved with high confidence. For instance,
the broad category at a leaf node of the taxonomy could be "four
wheelers:cars". The system can then cluster the Web pages in this category
and automatically determine for instance that the car type (e.g. luxury, mid-
size etc.) differentiate a number of the pages, and thus add the category
"type" to the taxonomy tree.
BROAD
CATEGORJES
Web O<x:umeou
DIRECTORY CLASSIFIER
TAXONOMY
Subcltegory Documenl
CluJIC:T'S
I 1
-(x(d,t)- j.l(c,t)r
c Ic I
Here f..l(c,t) is the mean vector of all the elements within a class 'c' for
each term 't'. cl and c2 are two classes, and x(c,t) is a sample from the
document training data. The TAPER[Chakrabarti et al 1998] system has
been applied successfully to train a taxonomy using 266,000 web documents
from 2118 Yahoo! Classes.
6. CONCLUSION
FURTHER READING
[Brin & Page 1998] is a good tutorial paper on building the Google Web
search system. [Bharat & Broeder 1998] discuss methods of measuring
overlap and sizes of Web search engines. Focussed crawling, topic
distillation, and feature selection are discussed in [Chakrabarthi et al
1999],[Bharat & Henzinger 1998], [Chakrabarthi et a1 1998a], [Chekuri et al
1997] and [Charabarthi et al 1998b]. [Cho et a1 1998], [Miller & Bharat
1998], [Aggarwal et al 2001], [Edwards et al 2001] and [Najork & Weiner
2001] discuss methods of URL crawling. Recent variants of dynamic search
is discussed in [Ben-Shua1 et al 1999], and meta-search in [Dreilinger &
Home 1997],[Wu et al 2001]. Some challenges in web search evaluation is
presented in [Hawking et al 1999]. Web graph analysis is covered in [Bharat
et a1 1998], [Gibson et a1 1998], [Hezinger et al 1999], [SciAmer 1999],
[Kleinberg 1998], and [Lawrence & Giles 1999]. [WWW
SearchEngineWatch] is a good online site that is updated with useful
information on Web search and directories.
EXERCISES
1) Design and construct a Web crawler that starts off a set of seed URLs
and iteratively retrieves documents.
Ramesh R. Sarukkai 137
(8) Contrast the Web Search and Directory approaches to Web access.
In the mountain, stillness surges up to explore its own height; in the lake, movement stands still
to contemplate its own depth
Abstract: One of the tremendous advantages of the Web is the vast amount of
infonnation logged about user accesses. TIris infonnation can be mined to
gain deep insights into the Web site, its usage, and visitation trends and
other patterns. The basic techniques in data mining such as association
mining, classification, clustering and sequence analysis are first presented.
Applications of Web mining including server log mining, link analysis,
user trend analysis, collaborative filtering for recommendation and
adaptive web site organization are covered in this chapter.
1. INTRODUCTION
Millions use the Web to search for infonnation, communicate with each
other, purchase goods or access personalized data. This results in billions of
page views - a valuable source of infonnation. Infonnation about users'
interests, about what users' want, and about what users' search for. Such
infonnation can provide deep insights on how users' interests match with
each other, what they find appealing in a site, or even trends and personalized
preferences. The task of Web mining is to use this huge amount of
infonnation in a useful manner to extract trends, and predict user preferences.
2. DATA MINING
The collection and analysis of data for the purpose of determining trends
and customer preferences is not new. This problem has been used in many
retail business, and finance stock market forecasting applications. More
recently, the area of data mining has focussed on mining large amount of
data stored in huge databases, a field also termed as "Knowledge Discovery
in Databases" (KDD). The focus of data mining is to extract patterns from
data. The seven steps involved in KDD are:
a) Data selection/Sampling
b) Preprocessing of Data
c) Transfonnation/Reduction of data
d) Data Mining
Ramesh R. Sarukkai 141
The first step is to collect data and store it in a database. The next step is
selection of relevant data to do the analysis on. Often it may be useful to
sample a representative subset of data in order to quickly estimate trends.
The raw data generally contains errors and invalid field entries. A cleanup
process called "pre-processing" needs to be done. Data mining tasks
generally involve high-dimensional data: it is possible to reduce the
dimensionality by considering subsets of features relevant to the mining task
at hand. The next step is the actual processing and mining of data in order to
generate patterns and models. Theoretically, it is possible to have an infinite
number of models with the same data- so these patterns and models are
evaluated using certain criteria to determine the utility of the derived models.
The final step is the visualization of the data using these models to see if they
provide further insights into the nature or patterns in the data. In traditional
Online Analytical Processing systems, data was stored in databases, and
queries in languages such as SQL are used to generate reports and extract
patterns in the data. Data mining, on the other hand, often uses automatic
algorithmic approaches in order to extract patterns in the data.
3. ASSOCIATION MINING
1. Association Rules
142 Foundations of Web Technology
C = Count( X U Y ) I Count( X )
2. Causality
Causality refers to the notion of presence of items X causes item
Y to be present. This notion is best illustrated using the "diaper-
beer" example commonly cited in data mining examples. Let us
assume that the causality "diaper causes beer", i.e. diaper buyers
are likely to pick up beer. Promoting a sale on diapers will
increase diaper buyers' visiting the store, and in turn buying beer.
Increasing the price of beer will lead to improved profits.
3. Frequent Itemsets
Ramesh R. Sarukkai 143
O(m.im-1»)
A Priori Algorithm
The two key observations used for extracting itemsets efficiently are:
• If {A, B} has support 'a', then both A and B must have at least support
'a'.
• If A or B has support less than 'a', then {A, B} must have support
less than 'a'.
The first observation states that if the combined set {A, B} has a support
of at least's', then the subsets {A} and {B} must also have a support of at
least 'a'. The second observation states that if itemsets {A}, or {B} have a
support of less than's', then the combined set {A, B} must have support less
144 Foundations of Web Technology
than's'. These rules are very useful in reducing the number of itemsets
considered, in addition to providing an iterative scheme for generating the
itemsets. The A priori algorithm [Srikant & Agarwal 1995] proceeds in the
following manner:
a) Find all itemsets of size one that have a minimum support's'. (Set L{I})
b) Pairs of items from set L{I} become candidates for set L{2}. Again
compute and threshold to retain itemsets that have a minimum support
's' to generate L{2}.
c) Iterate over itemsets of lengths k, by combining itemsets from iteration
k-I, and retain those with minimum support's' as L{k}
d) Proceed upto the k needed or till the set L{k} is empty.
One of the problems is that the number of baskets is usually very large,
and the number of items is also large. This results in a large number of
itemsets of size 2, making it difficult to retain the pairs of items in main
memory. [Park et al 1995] propose an extension whereby pairs of items are
hashed into a table in the first pass. The idea behind the PCY algorithm is as
follows:
a) In the first pass, compute the frequent size I itemsets. Furthermore,
hash in pairs of items found in the basket into a hash table and
increment counts. Convert this hash table into a bit map indicating a
'I' for a frequent item pair, and '0' otherwise.
b) In the second pass, the hash table bitmap is loaded into memory. For
all the pairs of items from size I frequent itemsets, check to see if the
hash table bitmap indicates 'I', and if so create and entry and count
occurrences.
3.3 Example
Let us now illustrate the A Priori algorithm with some examples. Consider
the set of baskets shown in Table 32.
Basket Items
3 {A,B,C}
4 {B, E}
5 {B,C,D}
6 {A, B, C, E}
Thus, itemsets {D} and {E} are discarded since they did not pass the support
threshold. This leaves us with the following set of itemsets of size 1:
{A}, {B}, {C}
Next, we combine these itemsets to form possible itemsets of size 2:
{A, B} : support = 2/6 = 0.33
{A, C} : support = 3/6 = 0.50
{B, C} : support = 3/6 = 0.50
Thus, the sets {A, C}, and {B, C} are frequent itemsets of size 2 and so on.
It is clear that using the minimum support item subsets in order to determine
itemsets of a higher order enables the elimination of a number of low support
candidates.
4. PREDICTIVE MODELLING
Given a set of features, each data point can be viewed as a point in this
feature space. This is generally called the input space (which we denote as
X). The training data consists of a corresponding set of output values or
categorical classifications (which we denote as V). The goal of predictive
modelling is to estimate a function / that maps the input space vectors to the
output space vectors Y. This estimation is done using a finite sample of
training data, and a key issue is that of generalization. The function/may fit
the sampled data points very well, but may not be reflective of the general
classification patterns exhibited by the data samples. This is referred to as
"overfitting"s. There are numerous techniques that try to overcome this
problem such as validating classification accuracy on a separate set of data
not included in the training called "cross-validation". For the classification
problem, the function / assigns the input vector into one of a number of
disjoint sets or classes. In the case of classification, the sampled data points
are assigned one of two classes A or B. The classification function has the
following form:
Linear Classifiers
5 Overfitting indicates that the model fits the data samples too well!
Ramesh R. Sarukkai 147
With a set of data points Xl, xz, and so on, this results in a set of linear
equations that can be solved to estimate the value of the weighting
~oefficients.
Regression
The sampled data points X" X2, X3, ... XM are used to estimate the coefficients
W,and can be posed as the linear system:
Aw=b
and
148 Foundations of Web Technology
Minimize IIwll
W 2
Subject to Aw = b
Neural Networks
In 1943, McCulloch and Pitts proposed a simple model of the nerve cell
as a threshold unit which forms the basis of many neural network models
today. Each neuron has a set of inputs, which are individually weighted,
combined linearly, and thresholded in order to decide whether to "fire" or
not. A "firing" neuron essentially sends a signal of "1" to the neurons its
connected to, and a "non-firing" neuron sends a signal of "0" to the neuron
that its output is connected to. This process is repeated over all the neurons in
the neural network, and a final classification result determined at the
"output" neurons. Thus the input neurons take input, the neural network
transforms this input, and the output neurons determine what the resulting
output from the network is. Each single neuron is similar to the linear
Ramesh R. Sarukkai 149
Input Vector
Output Layer
The error is estimated by using the output that's expected (based on the
training data), and the output that's generated by the network. Thus, in a
single layer network, the error of the output from corresponding inputs is
used to modify the weights of the linear network. In the case of multi-layer
network, there are internal "hidden layers" that do not have any output
specified in the training data. How do we then modify the weights in the
hidden layers based on the final output? One popular algorithm for
determining the weights in a multi-layered neural network is the "back
propagation algorithm"[Rumelhart et al 1987]. The key concept is the
propagation of the error in the output layer to the internal layers, which is
used as an "error function" for the hidden layers used to adapt the
parameters.
Decision Trees
Decision trees partition the training data recursively until each partition
entirely (or within a threshold) belongs to a particular class. Thus decision
trees are tools for classification and prediction. The decision tree consists of a
tree model where each node is either an internal "decision node" (split-point)
or a leaf "classification node". Classification proceeds by walking down the
decision tree using the unclassified data till the leaf node is reached, at which
point the predicted class of the data sample is known. Decisions are often
Ramesh R. Sarukkai 151
rule based and can be arbitrary functions of the data sample. Figure 25 shows
an example of a decision tree.
A>1?
gini(S) = 1 - ~ p/
Now if a split divides S into two subsets S1 and S2, then the gini index of
the divided data can be computed using:
The attribute that minimizes the gini index is chosen as the split point.
from the root node to a leaf node. However, for large datasets, construction
of the tree should be efficiently achieved since multiple passes over the
datasets are needed.
Formally, if we have 'J' judges, and 'I' items, then the ratings can be
viewed as a J x I matrix of rating values, where some of the ratings are
missing. An example of an algorithm to solve this problem is the 8M
algorithm [8haranand & Maes 1995]. The 8M algorithm uses a linear
combination of observed ratings, weighing similar judges higher. An
example of such a rating matrix is shown in Table 33. The '.' in the table
indicate that the article has not been rated by the user. The ratings in the table
range from 0-5,5 being the highest rating, and 0 the poorest rating.
I
j~ j'
W jJ,R jJ'
N is the neutral rating value, usually the midpoint of the rating scale.
For instance, if f(x) > 0.5 then its class "Summer", and if f(x)<= 0.5, then
its "Winter". Such a function is called the "step-function" and is non-linear.
Let's visualize the linear classifier that we modelled using the sample
data points. The linear classifier in our example is a plane that slices the
three-dimensional plot of "Sunny(S)" versus "Heat(H)" versus "f(x)" on the
third axis. Figure 26 shows a graph of the "Sunny" versus "Heat", the line
shown is the intersection of the linear classifier at f(x) equal 0.5. It can be
seen that the top left corner of the line represents values where f(x) is greater
than 0.5, and the lower right part represents values of f(x) less than 0.5. This
classifier maps the top left part to the class "Summer" and the bottom right
part to "Winter". The three data points are represented as stars and belong to
the appropriately labelled classes. An important aspect to note in this
example is that even though the sample data points map to the right classes,
the actual linear classifier is not necessarily "general". It may be noted that
the classifier essentially uses the "Sunny" in a positive sense to weight
towards class "Summer", while the contribution of "Heat" is weighted
negatively. Intuitively, both should be weighted positively such that lower
values of "Summer" and "Heat" map to class "Winter" and vice-versa.
Sunny
Heat
1
5. CLUSTERING
5.2 Techniques
• K-Means Method
A problem with the k-means algorithm is that it often finds the local
optimum value of the cumulative error. Furthermore, the k-means is sensitive
to noise and data points that lie at the boundary of clusters (called outliers).
• k-menoids
• BFR
The BFR algorithm is based n k-means and tries to estimate the mean and
deviation along each dimension of a normally distributed cluster model. In
the BFR model, a cluster consists of a Discard Set which is the core set of
160 Foundations o/Web Technology
data points that belong to that cluster and are used to detennine the centroid
and standard deviation model for that cluster. The next part is the
"Compression set" which consists of the data points that are close to each
other, but far from any clusters centroid. These Compression sets are also
modelled by a centroid and a variance. Finally, data points that do not belong
to either the Discard Set or Compression set are called Retained Set and kept
in main memory.
• GRGPF
• CHAMELEON
The clustering algorithms that we have discussed so far use some form of
distance metric in order to partition into clusters. While this is powerful,
other interesting approaches to clustering are based on "density". The set of
data points can be viewed as a space of "dense regions" and "sparse regions".
Algorithms such as DBSCAN[Ester et al 1996] counts the number of data
points that reside within a particular region and the region is classified as
"dense" or "sparse" based on this count. OPTICS [Ankerst et al 1999]
introduces the notion of cluster ordering for automatic cluster analysis. Both
methods require a spatial index structure like R*-tree[Beckmann et al 1990].
Density based approaches are more suitable for low dimensional data
clustering.
• Grid-Based Methods
In grid based approaches, the data vector space is quantized into a finite
number of cells which form a grid data structure. This reduces the
complexity of clustering over density based methods. Some examples of
Grid-based approaches are STING[Wang et al 1997], WaveCluster
[Sheikho1es1ami et al 1998], and CLIQUE[Agarwal et a1 1998]. STING is a
hierarchical grid-based approach which computes statistical information
within each grid, that is collectively used to determine good clusters.
WaveCluster uses wavlet transformation to identify dense regions in the
transformed space. CLIQUE integrates density-based and grid-based
clustering, and applies A priori property from association mining: if a region
is dense in k-dimensions, then it must be dense in (k-1) dimension
projections.
• FastMap
5.6 Example
The first step is to assign the data points as belonging to cluster 1 (mean
m1) or cluster 2 (mean m2). For instance, the squared Euclidean distance of
data point 1 to centroid m1 = 5*5 + 2*2 = 29. The squared Euclidean
distance from data point 1 to centroid m2 = 7*7 + 7*7 = 98. Since 29 < 98,
data point 1 is assigned to cluster 1. After the cluster assignments are made,
the means of the data point in each cluster represent the centroids. In our
164 Foundations of Web Technology
example, after the first iteration, the assignments of the data are shown in
Table 36.
Using the information in Table 36, the new centroid vales are computed
to be:
ml * = (0.3, 0.7)
m2* = (7.5, 7.5)
This is shown in Figure 28. It can be seen that the centroids have
"moved" towards the cluster of points in the data, and separated the two sets
of clusters. The "+" in the figure are data samples, and the stars are the
cluster centroids.
....
". " .
..•.
........
Cluster 2 +
.................. + *m2
Cluster 1 " .
......................
+ ....11-+
Hml .....
+
Furthermore such techniques can be used for agent assisted browsing (e.g.
[Cheung et aI1997], [Shahabi et aI1997]). The system suggests links that the
user can follow during the process of browsing. The second approach is that
of tour generation(e.g. [Joachims et a11997]) wherein the system generates a
tour which takes the user from one link to another. [Wexelblat & Maes 1999]
describe the footprint system which provides a metaphor of travellers
creating footpaths which other travellers can use.
168 Foundations of Web Technology
and only documents that match that topic profile are presented (or
highlighted) to the user. [Soboroff & Nicholas 1999] discuss such an
approach by combining LSI with collaborative filtering with limited success.
WebWatcher[Joachims et al 1997] uses words from the document the user is
browsing to detect the topics of interest to the user, and estimates link
probabilities using TF-IDF heuristic using the extracted keywords. A second
approach used by WebWatcher is based on reinforcement learning where
each link is represented as a state in the reinforcement learning state space,
and the rewards correspond to the TF-IDF measures. This illustrates a
combination of filtering and navigation.
Analysis of Web sites, how important they are, how relevant they are, and
how they influence each other is an important area of study. We have
discussed methods such as PageRank TM, and the Hub-Authority Analysis
algorithms in an earlier chapter. An important notion is the Web can be
viewed as a huge interconnected graph, with various patterns that can be
extracted and mined from this graph. Hub/Authority is just one such
example.
aspect is the need for multiple traversal of data. Often, the data is distributed
and the algorithms applied on subset of data. However, sub-division of data
often results in loss of information, and needs to be corrected using
heuristics.
8. CONCLUSION
Analysis of Web users and usage patterns is one of the most important
advantages of the Web framework. An inherent aspect of Web technology is
the ability to capture information about user accesses, and mine this
information to retrieve a number of useful statistics, trends, and user models.
These models can be applied to various applications such as shopping
recommendation, popular sites list, click through analysis, and detailed
information on the user population (ranging from demographics, gender,
interests and preferences). While the potential benefit of Web mining is
apparent in the sales cycle, Return On Investment improvement on sites,
cross-sell in commerce, and improved navigational interfaces, privacy and
protection of any sensitive data collected is of paramount importance.
In this chapter, the various techniques of data mining were first presented.
Association mining techniques provide the ability to perform "market
basket" analysis and generate association rules with support and confidence.
Various algorithms for association mining are discussed. Classifications and
regression are also an important part of data mining and have numerous Web
mining applications. Methods of classification such as linear classifiers,
neural networks and classification trees have been discussed. Clustering is an
important area of data mining and applied on unlabelled data. Techniques for
clustering have also been presented with illustrations. Lastly sequence and
event mining is briefly discussed followed by a discussion of some Web
mining applications.
FURTHER READING
EXERCISES
3) Table 6 below shows data to be used for exercises (3) and (4).
Detennine a linear classifier that fits the data shown as far as possible.
4) Generate a decision tree classifier that fits the data shown in table 6.
7) Using the Web server log, sort the sequence of URLs requested by
client IP address and time. Extract three sequence URLs and
threshold with a certain minimum count. See if these sequential
pattern ofURL traversal holds for new requests.
8) The matrix below shows a set of items that are rated by a set of
experts.
Ramesh R. Sarukkai 175
The best does not come alone. It comes with the company ofthe all
1. INTRODUCTION
2. MESSAGING APPLICATIONS
Electronic mail existed before the Web. The ability to transfer electronic
messages from one computer in a network to another has existed for a few
decades now. What the Web did was make applications so that electronic
mail (or e-mail) is accessible for a large number of users. Email web
applications allow users' to goto a Website and register for their
personalized e-mail address. These sites also provide management tools to
read, delete and manage the list of messages. With such a simple and
flexible access to registering and using e-mail, the Web fuelled the large
scale adoption of e-mail through the Web.
Email
Web
Server
Message
Server Message
Store
3.2 SMTP
conunand. The next steps are the actual mail transactions. Mail transactions
starts with the MAIL conunand that identifies the sender. The RCPT
conunand identifies the recipients, and multiple RCPT commands can be
specified in case of multiple recipients. Lastly, the DATA command initiates
transfer of mail data. The data is usual1y tenninated by a "end of mail"
indicator which also confinns the transaction. The QUIT command may be
sent by the client to complete the session and close connection. An example
ofa sequence ofSMTP commands is shown in Table 41.
3.3 POP3
The Post Office Protocol (POP) is a simple protocol that al10ws a system
to retrieve mail stored on mail servers (such as SMTP servers that support
POP). POP is a simple protocol that limits ability to retrieval and deletion of
182 Foundations of Web Technology
The POP server listens for requests on port 110. A client that wishes to
make use of the service makes a TCP connection to the server. A greeting
message is sent by the POP server to the client to start the session. The next
step in the session is authorization, wherein the client sends identification
information to the server. Once successfully identified, the server acquires
resources associated with the clients mailbox. The client then proceeds by
requesting actions and the server performs the actions and returns responses.
Once the required series of actions is completed, the client can send a "quit"
command and the connection is closed. In some cases such as auto logout
timeout, the connection can be aborted.
• Authorization
Different mechanisms of authorization are allowed in POP3, including
the USERIPASS commands and the APOP command. In the USER/PASS
scenario, the client must first issue the USER command. If the mailbox
exists and allows plain text password authentication, then the server returns a
positive response. The PASS command is next issued by the client to
complete the authorization process. In the APOP command, the argument
includes a string identifying the mailbox, and a encrypted (MD5) digest
string. The digest string is computed using a timestamp issued to the client
by the server and a secret string shared by the client and the server. The
reason for the APOP method is to avoid sending the password in the clear on
the network.
• Transactions
Let us now discuss the transactions allowed in POP3. The basic POP3
response is either "+OK" or "-ERR". The commands that the client can send
includes QUIT, STAT, LIST, RETR, NOOP, and RSET.
The STAT command is used to retrieve information about the mail drop
including the number of messages and the size of the mail drop. The LIST
command takes the message number as an option, and the server retrieves
the information for that message. If no message number is specified,
information about all the messages for that mail drop is returned. The RETR
command takes a message number as an argument, and the server retrieves
the message corresponding to that message number. The NOOP is a no
operation command. The RSET command resets and marks all messages
marked as deleted as unmarked. The QUIT command is used by the client to
indicate that the client has completed the session. The server removes all
Ramesh R. Sarukkai 183
messages marked as deleted from that mail drop, unlocks the resources
associated with the mail drop, and closes the TCP connection.
3.4 IMAP4
The Internet Message Access Protocol (IMAP) allows the client to access
and manipulate electronic mail messages on a server. In this section, we
discuss IMAP version 4, rev. 1 [IMAP4vl]. IMAN allows clients to access
remote mail folders (called mailboxes), and make them seem functionally
equivalent to a local mailbox. IMAP allows creation, deletion and renaming
of mailboxes. IMAP allows MIME-parsing, searching, selective fetching of
message attributes and so on. Messages are accessed in IMAP using unique
identifiers or message sequence numbers. Unique message identifiers persist
across sessions, pennitting the ability for a client to resynchronize its state
from a previous connection with the server.
IMAP server listens to port 143 when TCP is used for transport. The first
step is the establishment of connection between client and server. Next the
server greeting is sent, followed by client server interactions consisting of
client command, server response and a server completion result response.
The client command that starts an operation is prefixed with a short
alphanumeric string, called an identifier tag. This identifier is unique for
each client command. Data transmitted by the server to the client is prefixed
with "*,, or "+". "*,, indicates that the response does not indicate command
completion and are called untagged responses. Three possible completion
commands include OK (success), NO (failure) or BAD (error). Server
responses are generally of three types: status response server data, and
command continuation request.
4. 1M ARCHITECTURE
Friends
Storage Socket Desc.
Server Info. Server
Client
A User
Authentication
Server
Client
B Online/Offline
User status Server
Offline
Messages
On the flip side, the sequence of steps that are needed when a 1M client
logs out or disconnects are:
- Client "A" disconnects from server.
- Server detects this, and releases appropriate connection after informing
connection manager
- Status of user "A" set to offline by sending message to status server.
- All online friends of "A" are notified of new status.
- Future messages to "A" are stored in offline message store.
186 Foundations of Web Technology
Now that we have a general picture, let us reiterate the design choices in
an 1M system:
- Connection type: UDP, TCP or a combination?
- Clients connect directly?
- How to ensure privacy and security?
5. COMMERCE APPLICATIONS
Before a consumer purchases on the Web, the consumer must trust both
the medium and the Web site in order to disclose private information such as
a credit card information. The consumer must be convinced that this
information will not be misused by the company or any other entity will not
get unauthorized access to this information. Furthermore, the commerce
activity of the user must be strictly confidential, and the privacy terms
agreed to by the user must not be violated in any manner. Consumers need a
trustworthy brand that provides goods of value and quality. In certain
applications such as airline and hotel reservations, the Web provides an easy
and convenient mechanism of shopping. For instance, the consumer can
search through different itineraries, and prices, before choosing a ticket
purchase.
• Biztalk
• EbXML
The second view is the Functional Service View (FSV) that is the
framework to discover and convey the Business Object information. At the
core of the FSV are "distributed repositories" which hold the Business
Process and Information Models, the ebXML MetaModel, Trading Partner
Profiles, and ebXML specification. EbXML Registries serve the purpose of
retrieval of the repository. The Trading Partner Profiles (TPP) defines the
partner's capability and security mechanisms supported, while the Trading
Partner Agreements (TPA) define the agreements between the trading
partners.
registry in each of the layers. Furthermore, the definition of queries and the
response documents are defined in the registry's public interface.
-.NET
Microsoft has developed the .NET platform for rapid adoption and
development of web services8 • Web services enable service registration,
lookup and easy integration. An important aspect is that the .NET platform is
bundled with a base set of services including personalized user database
management (called passport) which include e-commerce related
information like the user's electronic wallet, and can be integrated by
vendors to do transactions.
-OMG Commerce
-IOTP
- OAG Framework
-OBI
The Open Buying on the Internet (OBI) protocol provides a secure and
interoperable framework for business-to-business commerce development.
HTTP is used at the transport layer, and the security is achieved by using
SSL. The OBI architecture consists of requisitioner, buying party, selling
Ramesh R. Sarukkai 193
• RossettaNet
There are several technologies for electronic payment. Some of the early
technology embedded microchips in credit-card like plastic pieces. This
embedded microchip permits flexible options such as choice of payment
mechanisms (such as Visa, or MasterCard), and the ability to store a large
amount of information. Various flavours of such "smart cards" include
ability to recharge, transfer payments, and use in conjunction with credit
cards. Some examples of efforts on development of Smart Cards include
Integrated Circuit Card (ICC) Specification for payments, and VisaPay.
Another form of electronic payments is the "token". Tokens may not
necessarily be associated with real money although they frequently are.
Tokens can be associated with a certain value and be used for electronic
transaction between parties. While electronic token systems are very
promising, there are many difficult issues such as ensuring atomicity of
transactions involving tokens, and network issues when transporting tokens.
• SET
Every IFX request document must contain' a SignonRq called the signon
message. The services are then listed and the actual message/command
embedded with each service. Some of the services that are supported by the
IFX standard include base services such as enrolment and maintenance,
banking services, payment services, billing and notification services.
Messages can be in the form of a request message or a response message.
Several types of common messages used in IFX include:
</SignonRq>
<PaySvcRq>
<RqUID>some UUID</RqUID>
<SPName> example.com </SPName>
<PmtAddRq>
<RqUID>some UUID</RqUID>
</PmtAddRq>
</PaySvcRq>
</IFX>
7. EXAMPLE ARCHITECTURE
module that determines what are the goods that is being sold to the
consumer, and often comprises of its own set of components (e.g. servers
and databases). These application servers may also generate front-end Web
pages for the services or goods being sold, and may interact with a wide
variety of other external modules in order to accomplish the end goal: enable
the consumer to purchase a service or goods through the system.
Inventory Electronic
Manager Wallet
I
Commerce
Application Billing
Server Shipping
Module
I
Fraud Transaction
Controller Manager
Upgrades
Downgrades
Cancellations
Scenario 1: Pay-per-Use
In this fIrst scenario, the consumer selects an item or a service to
purchase. For example, the service may be access to a live broadcast event
between certain times. The user must be charged after they are successfully
able to watch the broadcast program. In such a scenario, it is meaningful to
charge the user for this single event, and deduct that amount after the user
has access to that event. Generally, pay-per-use is a simple form of purchase,
since the user can be charged immediately after approving the purchase. In
some cases, complications may arise. For instance, what happens if the user
is charged for the broadcast service, and starts watching the event, but is
disconnected by the server during the event. In such cases, the payment
account information of the user is verifIed at the time of purchase, and the
actual charge to the account done after the user has successfully watched the
event.
Subscriptions also require a module (see figure) that monitors the billing
cycles of all the subscribers, and charges the appropriate users at the
appropriate dates for the respective amounts. Thus, unlike pay-per-use
model, subscriptions require recurring charges to the user on a periodic
basis.
200 Foundations of Web Technology
Service
/Products Pricing
Manager Models
Subscription
Process Subscription
Database
Billing
Module
Scenario 3: PromotionslDiscounts
The other variable in pricing models is the notion of promotions and
discounts. Different pricing models for the same service or product can be
established based on factors such as date of purchase, or discounts during
certain holidays. Management of such promotions makes the billing task
more complicated.
Scenario 4: Packaging
Let us assume that service "A" has a monthly subscription rate of "s 1".
Service "B" has a monthly subscription rate of "S2". Services "A" and "B"
may be compelling as a packaged single product, perhaps sold at a
discounted price "S3". Such flexibility will allow a wide variety of
combinations in packaging products and selling bundled pricing to users' .
The subscription database contains a list of pricing IDs for each user.
Periodically, the subscription monitor queries the subscription database for
the list of charges to be made at a particular date, and the users to be
charged. This list is translated into a list of charges and the users electronic
payment account information. These charge sheets can either be routinely
pushed to a third party financial billing vendor, or pulled by the vendor and
the accounts charged effectively.
Let us consider the sequence of steps that actually occur during a charge
cycle. The simplest case is when a user purchases a single item and charges
immediately. The sequence of steps are:
There are charges incurred by the billing party during these steps. Step
(c), which is the authorization step, requires that the system verify the
validity of the user's electronic payment account. Such verification may
include validation of a credit card, or verification of the current billing
address, and other security checks. Such validation is generally done with a
billing partner or a financial institution such as a credit card company, and
generally incurs some cost to the billing agency. In step (d), which is the
"settlement step", the account is charged for the appropriate amount, and this
action also generally incurs some cost. It is obvious that the profit on the sale
of the item should take into consideration such validation costs incurred.
202 Foundations of Web Technology
What happens when a user buys multiple items? Would the billing
system repeatedly incur charges during operations in steps (c) and (d)? An
alternative is the "fund reservation" method, which is best illustrated with an
example. Assume that a user wants to purchase 5 items "A", "B", "C", "D",
and "E". Each item costs a dollar. The default approach is validation and
charging for each purchase. The "fund reservation" approach is as follows:
Since the verification and the charging of the account are done only once
in the above scenario, this approach saves cost. This requires the additional
step of "fund reservation". Reserved funds are normally released if the
charge is not done within a specified duration of time. We can see this is
similar to the concept of micropayments that we discussed earlier.
being free for the first month (and a charge after the first month), then
fraudulent user's can signup with different new accounts and get the service
for free by cancelling after the free month. This kind of fraud can be traced
in certain cases, and abuse prevented by blocking such malicious account
users. Fraud is not special to e-commerce, but rather is a side-effect of any
commerce activity. A credit card may be stolen, and used for purchases. In
such cases, accounts with the stolen credit cards need to be identified, and
access to the service blocked or terminated. Another issue is one of bad
credit. What if a user signs up with a valid credit card, but then the card
expires or is cancelled. The billing system in conjunction with the fraud
controller should be able to address such cases effectively in order to ensure
that the service is being delivered to the right customers, and not to
fraudulent users. Fraud control is an essential part of any profitable e-
commerce business.
The components that have been discussed so far are common to any
commerce activity. The core module that is unique to the actual commerce
application is the "Commerce Application" module. A paid email service or
a paid broadcast service would use the same common components such as
the wallet and the billing infrastructure. But each of these services will have
it's own application infrastructure such as management of the email store,
and authentication of user account. For instance, the shopping module would
maintain a database of items available for sale, the vendor the items are
available from, list price and so on. A store hosting application will need to
have components that allow store owners' to edit and manage their store
online, in addition to being able to upload inventories and fulfil orders. The
commerce application module may have another block that allows the user
to search through a list of items. Such e-commerce application specific
functionality is included in the commerce application module.
• Distributed
• Redundant resources
• Fault tolerant design
• Diverse integration mechanisms
• Tight coupling with components such as transaction manager,
billing.
204 Foundations o/Web Technology
• Real-time processing
• Pipelined framework
• Atomicity and Recovery
• Inherent Security and Privacy features
• Application specific features
8. CONCLUSION
FURTHER READING
Protocols such as SMTP, POP, and IMAP are covered in RFCs [RFC
2060], [RFC 1939], [RFC 2683] and [RFC 2821]. A good review of e-
EXERCISES
5) Build a shopping site that can list upto 50 items. Any user visiting that
site can add items to hislher shopping cart. Implement a simulated wallet and
inventory that will allow users to "purchase" those items.
The sky remains infinitely vacant for earth there to build its heaven with dreams
Abstract The rapid development of mobile technology has fuel1ed growth in the area of
wireless Web. Wireless Web technology enables access to information from
the World Wide Web on any mobile device: anywhere, anytime. In this
chapter, the technologies that make "Wireless Web" a reality are discussed,
ranging from mobile communication infrastructure to delivery of Wireless
Markup data. The Wireless Application Protocol (WAP) and the
corresponding Wireless Markup Language (WML) are presented with
examples. Methods of generating wireless markup including HTML
transcoding and XSLT are summarized. The important area of Short
Messaging Service (SMS) for mobile originated/terminated message delivery
and emergent trends in mobile access to information from the Internet are
covered.
1. INTRODUCTION
The technologies behind Wireless Web are diverse, ranging from cell
towers to XML generating web servers. In this chapter, the general
architecture and description of the Global System for Mobile communication
is presented. Next, the architecture of a popular protocol suite used for
implementing wireless applications called the Wireless Access Prqtocol
(WAP) is discussed. This leads to the description of the markup languages
used commonly in Wireless Web including Wireless Markup Language
(WML), HDML, and cHTML. Limitations of the current technology, and a
discussion of the current emergent standards such as the Third Generation
(3G) initiative are summarized.
,_._._._._._._.,
j I
I PSTN '
I '
I '
: PLMN !¢=:J I--......==;-.r
I I
I I
: PSPDN ; GIWU
'-'-'-'-'-'-'-'~
processed in a visitor location register (VLR). The MSC that the VLR is
connected to gets the information from the appropriate HLR.
Wireless
Devices
r1
U.- ---~ '--_W_A_P_
_ Gateway .....I----\--+ I~~ I
At the heart of the WAP protocol is the WAP stack. The WAP protocol
stack is built using various protocols at different layers of the stack. This is
shown in Table 44.
• HTTP Interface
Ramesh R. Sarukkai 213
So why is there a new suite of protocols that have been defined in WAP?
Why were protocols for the Internet such as TCP not simply used? The
overhead of Internet-oriented protocols such as TCP can be avoided using
wireless specific protocols. For instance, reliable transport means different
things in TCP/IP and the wireless datagrams. TCP requires an overhead to
handle out-of-sequence packets. However this is redundant in wireless
datagram transmission since there is only one possible route between the
gateway and the handset. In practice, the WAP gateway maps incoming
requests to a number of WAP proxies. Most of the Internet related work is
done by the gateway including tasks such as DNS lookup. Another
advantage is the eschewing of a TCP stack at the handset/wireless device
that enables the WAP client to be very "thin" (Le. low processing power and
memory requirements). Furthermore, since a lot of the work is done at the
gateway, the amount of information transmitted is smaller than a
corresponding HTTP request, thus leading to savings in wireless bandwidth.
Furthermore, Wireless Telephony Application (WTA) allows call control
functionality. WAP also provides means for pushing information from a web
server to a specific device.
4.1 WML
4.1.1 Cards
<?xml version='l.O'?>
<!DOCTYPE wml PUBLIC "-/fWAPFORUMI/DTD WML 1.lI/EN"
''http://www.wapforum.orglDTD/wml_l.l.xml">
All text is enclosed in the <p> tag. The <p> tag represents a paragraph,
and all wml documents should have at least one block of text enclosed in the
<p> tags. A simple "hello world" example is shown below in Table 45.
Ramesh R. Sarukkai 215
4.1.2 Transitions
Jumping from one card to the next is done using the 'do' tag. The 'do'
tag has attributes 'type' and 'label'. The 'type' attribute allows mapping of
keys or actions on the device to certain actions. Upon receiving an key
action specified in the type, the browser goes to the card whose 'id' matches
the 'label' attribute. The different value for the 'type' attribute are accept,
prev, help, and options. It should be noted that the implementation of these
actions can vary depending on the device. Table 3 illustrates a transition
from one card to the next when the user clicks on the accept key on the
device.
216 Foundations of Web Technology
Table 46. WML Example illustrating transitions from one card to the next.
<?xml version='I.O'?>
<!DOCTYPE wml PUBLIC "-/IWAPFORUM//DTD WML l.l/IEN"
''http://www.wapforum.orglDTD/wml_l.l.xml">
<card id="firstcard" title="Welcome">
<do type="accept" label="Ok">
<go href="#nextcard"/>
</do>
<p>
Hello world
</p>
<Icard>
<card id="nextcard">
<do type="prev" label="Prev">
<go href="#firstcard"l>
</do>
<p>
In the next card.
<Icard>
</wml>
The example shown in Table 46 also illustrates the use of the 'go' tag.
The do specifies the action, and the embedded go element indicates which
card the control should transition to. Thus, the first card is displayed on the
device. The text "Hello world" appears on the screen. When the user press
an accept key (for instance the "ok" button on a Sprint phone), the next card
is loaded and the text "In the next card" is displayed. The card "nextcard"
has a 'go' tag within a 'do' tag that allows transition to the previous card. If
the user presses the 'previous' key (or equivalent), then the card "firstcard"
is loaded and the text "Hello world" displayed.
4.1.3 Anchors
The anchor tag is the same as the anchor tag in HTML. The anchor tag is
a child of the paragraph «p» element. An example of the anchor tag is
shown below in Table 47:
Ramesh R. Sarukkai 217
When the user clicks on the text "Click this", then the next page specified
by the href attribute of the go tag is fetched and displayed to the user.
4.1.4 Input
Input from the user can be collected using the 'input' tag in WML. The
'input' tag has attributes 'type' and 'name'. The 'type' attribute indicates the
type of input collected (for instance 'text'). The 'name' attribute associates a
variable name with the input collected, and can be referenced later in another
card as shown below in Table 48. An input variable called 'secret' is
collected in card "firstcard". When the user presses the accept key, the card
"nextcard" is accessed and displayed to the user. The value of the variable is
referenced using "$(secret)" as shown in the card "nextcard". The 'do' tag in
this card contains a 'go' element which specifies the method of transition
(i.e. POST versus GET), and the URL to transition to. The 'postfield'
element specifies the name and value of the variables that will be posted to
the specified URI. It can be noted that the value of the parameter called
'mysecret' is set to the value of the input collected in the card "firstcard",
namely "$(secret)".
218 Foundations of Web Technology
<card id="firstcard">
<do type="accept">
<go href="#nextcard"l>
</do>
<p>
Enter your secret:
<input type="text" name="secret"l>
</p>
<Icard>
<card id="nextcard">
<do type="accept">
<go method="post" href=..www.wapserver.orglprocess.cgi..>
<postfield name="mysecret" value="$(secret)"I>
</go>
</do>
<p>
Your secret is: $(secret)
Click ok if you want to continue.
</p>
<Icard>
</wml>
Another way of getting input from the user is selection from a list of
values. This is achieved using the 'select' and 'option' elements as shown
below:
<p>
Age:
<select name="age">
<option value="child">O-13<1option>
<option value="teen">l4-l9</option>
<option value="adult">20-60</option>
<option value="senior">6l-l OO</option>
</select>
</p>
Ramesh R. Sarukkai 219
4.1.5 Images
WAP supports the Wireless Bit Map Picture fonnat (WBMP). The 'img'
element allows the inclusion of images in this fonnat in a paragraph element:
<p>
<img src="home.wbmp"/>
</p>
The value of the 'src' attribute indicates the image location. Similar to
the HTML 'img' tag, other attributes include alt, width and height.
4.2 WML2.0
The WML version 2.0 was released by the WAP Forum, and provides
extensions to the WMLl.l discussed in the preceding sections. The
motivation of WML2.0 is to extend the syntax and semantics of other
standards, namely XHTML Basic, and CSS Mobile Profile, while retaining
backward compatibility with WML 1. In that regard, the WML2 specification
includes all the elements, attributes, and attribute values of XHTML and
CSS, while retaining only those elements from WML 1 that cannot be
expressed with XHTML, and CSS. These WMLl compatible elements are
prefixed with "wml:", and include the following:
220 Foundations of Web Technology
• wmI:access
• wml:anchor
• wml:card
• wml:do
• wml:getvar
• wml:go
• wmI:noop
• wml:onevent
• wml:postfield
• wml:prev
• wml:refresh
• wml:setvar
• wml:timer
5.1 Introduction
Transcoding M:lbile
mML Proxy r
~ce
WfiJ
Ser\m
/
XML
:style XSLT Processor
Figure 37 shows the building blocks of the transcoding approach. The first
step is the processing of the HTML page. HTML pages contain a number of
tags that are pertinent to devices with large resolution and high graphics
capability displays. In many cases, tags are used inappropriately for visual
rendering effects rather than specification of document structure. For
instance, <HI> tags may be used instead of using <FONT> to highlight text.
Although the adoption of XHTML and separation of content and style using
stylesheets alleviates this issue, it is still difficult to standardize the
utilization of HTML tags for text markup. Cleanup includes removal of
frames and framesets, removal of meta tags, elimination of image and audio.
The next step is parsing of the pre-processed HTML page in order to
identify the document structure. The result of parsing is a document tree
224 Foundations of Web Technology
with each internal node in the tree representing a tag and each leaf
corresponding text data. Transcoding rules determine how this tree
document structure extracted from the HTML page maps to a similar WML
document tree. For instance, the transcoding proxy may generate a WML
page listing all the header (HI) level text as a menu. When the user chooses
an item from this menu, the page transitions to another WML card with the
text that follows this header line in the original HTML page. The
transformation process may be enhanced by including meta tag information
to the HTML page in order to aid the transcoding process [Hori et al 2000].
The next issue is division of the original document into cards. Since the
display size and device memory of many mobile devices are smaller than
desktop monitors, the text that is displayed needs to be reduced to smaller
chunks. Furthermore, the sequence between these chunks need to be
maintained, whether between cards or WML pages. The last step of the
transcoding proxy is the caching of the WML documents returned to the
clients.
1-"-"-"-"-"-"-"-"-"
I. DxumtJ\Snmre I~
c:=>l i
I
I
......
WML WAP
StyleSheet Device
Common ~
HDML HDML
XML ~ StyleSheet ... Device
DATA
~ cHTML
StyleSheet ...
PDAs
Let us consider the XML document shown in Table 49. The root element
is "EmployeeRecord" which contains the element "Name". Table 50 shows
a style sheet (wml.xsl) that is used to transform the XML document in Table
49 to the following WML :
<wml>
<card>
<p>
Albert Tan
</p>
<Icard>
</wml>
226 Foundations of Web Technology
<EmployeeRecord>
<Name> Albert Tan </Name>
</Emp1oyeeRecord>
<wml>
<card>
<p>
</xsl:for-each>
<lp>
<Icard>
</wml>
</xsl:stylesheet>
<!-Ending stylesheet tag 7
WebSMS
Entities
Table 51. Steps involved in transmission of a SMS message to a mobile device (GSM).
1. Short message submitted from Web Short Message Entity to SMSC.
2. SMSC processes message, then requests routing information for this mobile subscriber
from the HLR.
3. HLR sends routing information: MSC handling the mobile subscriber.
4. SMSC sends the short message to the MSC. (forward Short Message Operation)
5. MSC receives message., and gets subscriber information from the VLR.
6. MSC forwards short message to appropriate mobile station. (forward Message Operation)
7. Result of the forward message operation is transmitted from theMSC back to the SMSC.
8. Delivery status is sent bv the SMSC to the Web Short Message Entity (if requested)
destination address. The key difficulty in this process is ensuring that SMS
(also called alerts) are sent on time. If 10,000 subscribers request notification
when a particular stock hits a particular high, then ideally all the alerts need
to be sent at around the same time (within a very short time window). The
number of processes or threads that must handle such notifications needs to
be tuned based on the average and maximum delivery loads on each system.
7. EMERGING TRENDS
access standards for data and voice are not yet established and endorsed as
standards.
With 3G, bandwidths in the range of 144 Kbps (for faster moving
mobile users) to 384 Kbps (for slower moving users) are expected. High
bandwidths open up a multitude of wireless applications: the phone can be a
music player with streaming audio downloaded from the Web, or it can be
you "always-on" connection to the Internet. Such grand visions which are
driving 3G are becoming a possibly near-future technology. Carriers such as
NIT DoCoMo in Japan are already building (i-mode) services around high-
bandwidth technology, with specialized devices that offer a wide range of
information access and services.
8. CONCLUSION
FURTHER READING
EXERCISES
1. Answer the following to be "True" or "False"?
a) WML and HDML are both XML based.
b) HTML and cHTML are not related.
c) WML is a markup language supported in WAP.
d) Communication between the mobile device and the WAP
gateway is using HTTP.
e) SMS guarantees delivery of the message to the mobile device.
2. Write a WML page that allows you to browse through a list of courses
that you have taken this semester. Upon selection of a particular course, the
location of that course must be displayed on the screen. This assignment will
require the download of a WML phone simulator (e.g. from Unwired Planet
or Phone.com).
<Computer>
11
</Computer>
<Fax>
2
<!Fax>
<Printer>
3
<!Printer>
<Copier>
1
</Copier>
</Department>
<JInventory>
c) Write a simple transcoding program that takes the HTML generated in (a)
and converts it to WML.
4. Write a program that simulates a SMS alert generator. This alert generator
periodically refreshes values of stock tickers from a file, and generates an
alert to the set of users when the value of the stock increases by 20%
Simulate the SMS by using email to deliver the message.
6. For each simulated request in (5), map the device request randomly to a
location in the 25x25 integer grid. Based on this location, generate a WML
page that lists the restaurants located only at that location.
Web Services
Abstract: Can the notion of web service be abstracted further? Can there be a dynamic
notion of service registration and discovery in order to use and combine
different services from different parties? What are the emergent standards that
allow such a degree of interoperability in Web services. Protocols at transport
layer such as SOAP, service description language such as WSDL, are
discussed in this chapter.
Keywords: Simple Object Access Protocol (SOAP), Universal Description Discovery and
Integration (UDDI), Web Services Description Language (WSDL), .NET
2. OVERVIEW OF ARCHITECTURE
The above aspects are not new, especially in light of the huge growth and
interest in the Web over the last ten years. Remote method invocation,
238
Ramesh R. Sarukkai 239
distributed computing using DCOM 10 , and CORBA 1I have been around for a
while now. However some of these protocols cannot be supported over
HTTP. Sun presented Java as a means of achieving "write once, run
anywhere" by introducing the concepts of bounded NM J2 , just-in-time Java
compilers, and platform independent executable byte codes. DCOM objects
exposed their methods, and can be integrated through IOL (Interface
defmition Language). In light of all the earlier efforts on remote execution,
protocols such as HTTP, integration between multiple application servers
that are distributed all over the world, and platform independent execution,
what is new in the notion around "web services"? How can web services
add value to existing frameworks, and what areas of opportunity exists in
this area?
• Discovery
Discovery is the mechanism by which services make
themselves known. This is typically achieved by having the
service registered using UDm (Universal Description,
Discovery, and Integration). The UDm defines a way to
publish and discover information about Web Services (in
some sense analogous to DNS for mapping domain names
on the Web)
• Description
10 DCOM stands for Distributed Component Object Model defined by Microsoft for the
invocation of applications in a distributed, networked environment.
II Common Object Request Broker Architecture (CORBA) is an architecture and
specification for creating, managing and distributing objects in a networked environment.
12 NM is an acronym for the Java Virtual Machine, which enables the execution of platform-
independent Java bytecode.
240 Foundations of Web Technology
• Transport
The communication method between the user of the
service and the service itself is called transport. Simple
Object Access Protocol (SOAP) is an emergent standard
defining the protocol for the exchange of information in a
de-centralized, distributed environment.
• Environment
The runtime in which the web services execute is
referred to as the environment. Typically, Just-In-Time (JIT)
compilers convert machine independent code to machine
dependent code in order to generate executable application
code.
UDDI
SOAP
XML
HTTP, TCP/IP
3. UDDI
Thus, any application that wants to use a web service for a specific
functionality will have to go to the UDDI Business Registry, and locate
information about such services. Conversely, companies that want to make
available their services will need to register with the UDDI registry.
The core component of the UDDI entry is an XML schema that defines
four kinds of infonnation: business infonnation, service infonnation, binding
information, and service specs. These are identified by the businessEntity,
businessService, bindingTemplate, and tModel elements respectively. The
bindingTemplate generally specifies the information required to actually
invoke the service, while the technical details are accessible via the tModel
which is metadata about a specification including name, publishing
organization and URLs to the specifications.
Locate
businessEntity
Infonnation
4. SOAP
a) Envelope
b) Encoding rules
c) RPC spec.
d) Binding convention
The first thing to note is that SOAP is in XML. The Envelope is the
root element of a SOAP message. The SOAP header can contain additional
information about the message. This is followed by a SOAP body which
carries the actual SOAP message. In the above example, we can see that
application specific elements (i.e. GetInventory) are embedded in the SOAP
body. SOAP also provides a mechanism of handling fault situations that can
arise in the handling or processing of messages. This is returned in response
messages using the fault elements:
<env:Fault>
<faultcode> env:Receiver </faultcode>
<faultstring>Processing error</faultstring>
</env:Fault>
Since SOAP headers are not part of the SOAP message, they may be ignored
by the SOAP processor. However, SOAP processors can be notified that the
SOAP headers must be processed by setting the "mustUnderstand" attribute
in the header elements to true.
244 Foundations o/Web Technology
5. PLATFORMS
Microsoft has released versions of its own web services platfonn called
.NET as a part of the Windows XP initiative..NET includes a modified C
language called C#, which allows programmers to tie into the .NET features.
The .NET platfonn is bundled with a Just-In-Time compiler and integrates
with MSIL (Microsoft Intennediary language). This is not a new concept
and is similar to the Java bytecode "compile-once/ron anywhere" effort. C#
is strikingly similar to Java in features such as XML support, implicit
garbage collection, and type-safe variables.
6. EXAMPLE OF A SERVICE
What are the steps involved in creating and using web services? We will
illustrate development of a web service using the Java platfonn as an
example. The components we need are a web server (e.g. Apache web
server) with support for the SOAP protocol. XML parser is also necessary in
Ramesh R. Sarukkai 245
UDDI
Publish Service (SOAP) Service
~
Web
Invoke
Service
-
....-
Web Service
- A~
1
(SOAP)
>-- .. _ .. _ .. _ .. -
WSDL
Document Web
Y'etchServicl
~escription
Service -
Lookup SerWce
Requester
·
Service Provider
. (SOAP)
Service Requester
The steps that a web service provider should perfonn in order to make a
web service available are:
<types>
<schema targetNamespace="http://example.comlexample.xsd''
xmlns=''http://www.w3.org/2000/10IXMLSchema">
<element name="SearchlnputQuery" type="string"l>
<element name="SearchOutput" type="string"l>
<lschema>
<ltypes>
<message narne="ExampleWebSearchInput">
<part name="body" element="xsd1:SearchInputQuery"I>
</message>
<message name="ExampleWebSearchOutput">
<part name="body" element="xsd1:SearchOutput"I>
</message>
<portType name="ExampleWebSearchType">
<operation narne="GetSearchResults">
<input message="tns:ExampleWebSearchInput"l>
<output message="tns:ExampleWebSearchOutput"1>
<loperation>
</portType>
<service name="ExampleSearchService">
<documentation>Example Search Service <ldocumentation>
<port name="SearchServicePort" binding="tns:SearchServiceBinding">
<soap:address location="http://example.comlexamplesearch"1>
<lport>
<lservice>
<ldefinitions>
248 Foundations of Web Technology
For the above service, the client needs to first locate this service, retrieve
the WSDL document (shown in the table), and build SOAP messages with
the appropriate input values. Then the client invokes the service (for
example by sending the SOAP message over HTTP to the location specified
in the service location), and evaluates the SOAP response (if any) from the
service provider.
7. LIMITATIONS
8. CONCLUSION
What is the future of web services and how will it affect the Web?
Since web services are still at an inception stage, this is difficult to predict.
But, it is not hard to imagine a set of specialized function providers who
defme services that are clearly specified, and a large set of application
developers who consume the services and integrate seamlessly with the
applications developed. It also potentially brings together a large
collection of services with a large impact on usability and redundancy of
applications. Thus, there can be a number of applications or web sites, but
they may share user databases from key providers, rather than every site
having its own infrastructure (e.g. for user identification). Another area of
impact would be the mobility of services over a wide variety of platforms,
250 Foundations o/Web Technology
FURTHER READING
Most of the efforts on Web services are standardization efforts. In that
context, standards published by organizations such as the W3C Web Service
group ( http://www.w3c.org/2002/ws ) or the Web Services Interoperability
Organization ( http://ws-Lorg/ ) are good resources for keeping up to date
with the standards such as SOAP, and WSDL. Platform specific
implementation guides and books for the leading platforms such as .NET,
SunONE, and IBM's WebSphere are abundantly available, and are good
sources for detailed descriptions of specific programmatic ways to integrate
web services with applications.
EXERCISES
1. Define an example service using WSDL.
2. Build a web server SOAP over HTTP interface to the web service
described in question 1. (You may want to use available software such as
apache SOAP/Java software to achieve this).
3. Implement a simple client that invokes the service, and displays the
result in the form of a web page.
I leave no trace ofwings in the air, but I am glad I have had myflight
1. REVIEW
With the vast amount of data available on the Web, millions of users'
search for and access such information. Furthermore, the Web drives a lot of
other activities such as commerce, listings and media services. Every Web
application collects a large amount of information about users' access
patterns and interests. Tools and algorithms for mining such huge data sets
in order to extract useful patterns and trends are vital. Such extracted
patterns can be applied in a variety of ways, ranging from improving the user
interface, presenting items of more interest to the users' and enhance the
Ramesh R. Sarukkai 253
There has been lot of interest in the integration of mobility with Web
access. In particular, access to Web content such as e-mail, fmance stock
quotes, and news are available on a variety of wireless devices. This enables
access to information anywhere, anytime. As an illustration, we presented
how the mobile infrastructure interoperates with the Web, and discussed the
WAP protocol suite. Wireless markup standards were also presented with an
overview of the Wireless Markup Language (WML). Short messaging
service is also an important aspect of wireless communication, and can be
integrated closely with Web applications. Methods of generating the wireless
content from Web servers were also illustrated with examples. Emerging
trends, such as utilization of user location information to enhance the mobile
web service, increased bandwidth, and enhanced user interfaces, are key to
the widespread growth and adoption of this fertile area.
description, and access that could lead to a wide spread adoption of the web
services infrastructure.
• Distributed
• Redundancy
The system should not have any point of failure. This is a very key aspect
of a live system that should be up 100%13 of the time. For instance, if we
have two servers going through a switch that distributes requests between
the two servers, then the two systems share the load. If one server goes
down, then all traffic is routed to the other server. Once the system that's
down is restored, then both continue servicing the end user. Note that, to the
end client who is being serviced, there is no perceived downtime or
unavailability of the service. Thus, it is important to have redundancy and
provide the ability for another component to service the client in case of a
failure.
• Ability to Scale
• Distributed databases
An important aspect of Web applications is the large amount of data that
needs to be stored, managed, and manipulated, whether it is storage of user
profiles, or shopping inventories and so on. This data should be distributed
in a scalable and replicated manner. Furthermore, redundancy and recovery
in case of failures is a crucial aspect of the system design in order to give a
no downtime experience to the end users'. Replication, consistency and
redundancy of databases ensure that data storage and access are not points of
failure in the system.
3. LIMITATIONS
• Network
With XML, there has been enormous progress in the ability for business
to communicate with each other, or client applications to transfer data to
servers, or servers to exchange information with each other. Yet, there is a
lot more that needs to be done. Standards are emerging, and different groups
adopt different standards making it difficult to converge on a common set of
standards. In some case, it may not be even meaningful to have a single set
of standards.
• Security
Security is the heart and soul of the Web. Without security, commerce
and personalization on the Web is meaningless. The most common approach
to security is the Public Key Infrastructure (PKI). But PKI is not without
limitations. To quote from [Ellison & Schneier 2000]: "security is a chain;
it's only as strong as the weakest link". What that means is that just by
having a valid certificate authority, or a very long key does not guarantee
security. How do you protect your own private key? How do you trust the
computer that verifies the certificate, which requires only public keys? A
name is generally associated with a certificate, but the security now lies in
knowing that the name corresponds to that person using the certificate. How
Ramesh R. Sarukkai 257
does the Certificate authority identify the holder of the certificate? Another
important aspect is that the user is not a part of the security design.
• Information
How do you search, retrieve and mine information contained in over two
billion documents and growing? Is it even reasonable to expect a set of
relevant web documents in response to a small query? What's the next step
beyond just text processing of information in the Web? How can we classify
documents, associate some "usefulness" or "semantics" to them, and provide
the ability to make certain documents relevant for specific users with
specific queries? Semantic Web effort addresses the issue of data markup of
documents on the Web for not human observation and analysis, but also for
machine agents. In order to expand the Web so that automated agents can
find, and extract "meaningful" information from the documents, two
frameworks are added on top of the XML framework. The first is a Resource
Description Framework (RDF) that allows encoding of "meaning". The
second are "ontologies"; an example of an ontology is the definition of a
taxonomy and a set of inference rules relating the objects in the taxonomy.
Many of these efforts are still in the nascent phase, and there are still many
unexplored issues in these areas.
4. THE FUTURE
With the adoption of web services, we will see more centralized efforts to
maintaining, storing, and protecting data such as user registration
information, wallet information, and personalization models. The notion of
every web site maintaining its own user registration database will be over-
Ramesh R. Sarukkai 259
The need for centralized data storage will push the technology for
enhanced security and privacy measurements. Perhaps, the solution may also
lie outside cryptography, such as biometric techniques for enhanced security.
On the privacy front, client applications will become more sensitive on the
type and content of information that is exchanged. It is more likely that
much of the client or user tracking and data will have to be managed on the
server side, and client side information such as cookies may need to be
extensively enhanced in order to cope up with privacy and security
requirements.
With the vast number of devices already connected to the Web in one
manner or another, and with the compelling need or convenience of access
to web information anywhere/anytime, there will definitely be a resurgence
of web-centric mobile devices. They may not be tied to Web portals or
browsing per se, but rather built off general web services available for a
260 Foundations of Web Technology
• Create a socket
• Bind that socket to an address/port
• Listen for client requests
• Upon client request, spin off a thread or a process to handle that
client.
• This client process can read and write to the socket, as needed.
• Repeat above steps.
t*
Simple server program in C
Platfonn: UNIX
To compile, type:
gcc server.C -0 server
To run the server, just type 'server'
*t
#include <sysltypes.h>
#include <syslsocket.h>
#include <stdio.h>
#include <netinet/in.h>
#include <netdb.h>
#include <errno.h>
main 0{
int sock, c1ientsock;
struct sockaddr_in serverAddress;
int PortNo=O;
int child;
t* create socket *t
if( (sock = socket(AF_INET, SOCK_STREAM, 0» < 0) (
perror("Server cannot create socket.");
exit(1);
262 Foundations o/Web Technology
1* This sets the listen queue for this socket to maximum of 5 clients nl
listen(sock, 5);
In For each client request, fork to have another process service that client nl
In Other options include spawning a separate thread nl
if( (child = fork(» < 0)
{
perror("Error create child.");
exit(I);
else
if (child = 0)
( In if child is 0, then this is the process servicing the request n I
close(sock); In Doesn't require parent socket nl
ServeRequest(clientsock);
close(clientsock);
exit(O);
Ramesh R. Sarukkai 263
close(sock);
#define BUFFERLEN 40
int ServeRequest(int clientsock)
(
char bufflBUFFERLEN];
int msgLength;
bzero(buff, BUFFERLEN);
/** Receive message from client **/
if( (msgLength = recv(clientsock, buff, BUFFERLEN, 0» < 0)
• Create a socket
• Connect to a server at a specific port
• If connection was successful, send data, or read data as required.
/*
Simple client program
To compile on UNIX, type:
gcc -Isocket -Insl client.c -0 client
To run, type:
client <hostname> <port>
The message "This is a test" will be sent to the server.
*/
#include <sys/types.h>
#include <sys/socket.h>
#include <stdio.h>
#include <netinetlin.h>
#include <netdb.h>
#include <ermo.h>
main (int argc, char ** argYl
int sock;
struct sockaddr_in serverAddress;
struct hostent *hstp, *gethostbyname();
const char *message = "This is a test";
if(argc < 3)
c1ose(sock);
c. Apache Software
One of the most commonly used servers lS are from the Apache Software
Foundation (http://www.apache.org ). Some of the key software include the
HTTP web server, Jakarta (server side software for Java Platforms),
modyerl (which allows the ability of writing Perl modules), PHP (server
side scripting language that can be embedded in HTML for dynamic content
generation), XML Project (Xerces- that contains the XML parser, SOAP
parser, & Xalanan). Instructions for downloading, compiling and installing a
HTTP server can be found in the website.
Once compiled, the apache server can be started on the machine. The
apache UNIX HTTP server program is called "httpd". The server can be
started by simply executing this command. The httpd.conf file specifies
various directives to the apache server. For instance, the 'Listen' directive
indicates which port the server should listen on. If successful, then the
request for the page ''http://localhost/'' with a web browser on the machine
you are running the server on should show a valid Apache page. Similarly,
apache server is configurable on Windows platforms.
15 Some surveys estimate that 54% of the servers on the Web use Apache.
Ramesh R. Sarukkai 267
This section describes how to create a simple CGI program in Perl with
Apache server. In the httpd.conf file, we can add some directives to enable
CGI requests to the server. The directive:
instructs the apache server to map any request with the prefix /cgi-binl to
the path /usr/locallapache/cgi-bin/. Thus, the request http://localhostlcgi-
bin/test.pl will actually result in invoking the CGI perl program that resides
in /usr/local/apache/cgi-bin/test.pl. The test.pl PERL program will then
process the input request and generate (perhaps dynamic HTML) pages for
display to the user.
The two basic things that your CGI program must do in order to be
properly processed is to return a MIME header, and then write out response
in HTML. The MIME header indicates to the client browser what type of
data to expect. A common header is:
Content-type: textlhtml
In order to write your first perl CGI program, edit a file called test.pl and
type the following contents into that file: .
# !/usrlbin/perl
print "Content-type: textlhtml\r\n\r\n";
print "Hello, World.";
Make sure that this is saved in the path that the /cgi-bin/ is aliased to (with
ScriptAlias). Thus, if the above program is saved as /usr/locallapache/cgi-
bin/test.pl, and the apache config has the ScriptAlias that we mentioned
earlier, then a request to http://localhost/cgi-bin/test.pl should generate the
content "Hello World" on the browser. There are many modules in Perl that
have many specialized CGI functionality built-in, and the user can refer to
one of the many books on learning Perl, and applying it to write CGI
programs. Some useful guides to Perl are [Schwartz & Pheonix 2001] and
[Guelich et al 200I].
268 Foundations of Web Technology
The Web service development pack from Sun is the Java Web
Services Developer Pack Early Access 2 (JWSDP EA2) release. This
includes Java APls for XML messaging, XML Processing,
Registries, XML Remote Procedure Call (RPC), and other tools.
JAX-RPC are Java APls that enable the user to perform Remote
Procedure Call (RPC) and implement the web service using Java.
The steps involved in defining a Java based Web service are as
follows:
import java.xml.rpc.server.ServiceLifecycle;
}
}
16 Stubs and ties are intermediaries that enable communication between a service endpoint
and a service client.
270 Foundations o/Web Technology
On the client side, it is easy to invoke the service using the client stub
for the service, such as:
ExampleSearchService_Stub stub =
(ExampleSearchService_Stub)
(new ExampleSearchService_ImpIO.getExampleSearchServicePortO);
stub._setProperty(Stub.ENDPOINT_ADDRESS_PROPERTY,
"http://testserver:8080IExampleServiceljaxrpclExampleServiceIF'');
String SResults = stub.getSearchResults(queryString);
[Ballard 1999] Dana H. Ballard, An Introduction to Natural Computation, MIT Press, 1999.
[Barford & Crovella 1999] Paul Barford and Mark E. Crovella. Measuring Web performance
in the wide area. Performance Evaluation Review, Special Issue on Network Track
Measurement and Workload Characterization, August 1999.
[Ben-Shual et al 1999] Israel Ben-Shaul, et ai, Adding support for dynamic and focused
search with Fetuccino, in Proc. of the Eight World Wide Web Conference (WWW'8),
1999.
[Bharat & Broder 1998] Krishna Bharat and Andrei Broder, A technique for measuring the
relative size and overlap of public Web search engines, in Proc. of the Seventh World
Wide Web Conference (WWW'7), 1998.
[Bharat & Henzinger 1998] K. Bharat and M. Henzinger, Improved Algorithms for Topic
Distillation in a Hyperlinked Environment, Proc. Of the ACM SIGIR conference, 1998.
[Bharat et al 1998] Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and
Suresh Venkatasubramanian, The Connectivity Server: fast access to linkage information
on the Web, in Proc. of 7th World Wide Web Conference, 1998.
[Bhuyan et al1991] Jay N. Bhuyan, Jitender S. Deogun, and Vijay V. Raghavan, Cluster-
based adaptive information retrieval, in the Proc. of the 24th IntI. Conference on System
Sciences- Architecture and Emerging Technology Tracks, vol. I, pp:307-316, Jan. 1991.
272 Foundations of Web Technology
[Bowman et al 1994] Bowman M., Danzig P., Hardy D., Manber U., Schwartz M., and
Wessels D., The Harvest Information Discovery and Access System, in Proc. of Second
IntI. World Wide Web conference, 1994.
[Breese et a11998] John S. Breese, David Heckerman and Carl Kadie. Empirical analysis of
predictive algorithms for collaborative filtering. In Proceedings of the Fourteenth Annual
Conference on Uncertainty in Artificial Intelligence, pages 43--52, July 1998.
[Brieman et a11984] Brieman L., J.H. Friedman, R. Olshen, C.J. Stone. Classification And
Regression Trees, Wadsworth, Belmont CA (1984).
[Brin & Page 1998] Sergey Brio and Lawrence Page, The Anatomy of a Large-Scale
Hypertextual Web Search Engine, in the Proc. of the Seventh World Wide Web
Conference (WWW'7), 1998.
[Brin 1998] S. Brin, Extracting Patterns and Relations from the World Wide Web, in WebDB
workshop at 6th IntI. Conference on Extending Database Technology (EDBT'98), 1998.
[CARP 1998] V. VallopiIlil, and K.W. Ross, Cache Array Routing Protocol vl.O,
http://icp.ircache.net/carp.txt,Aug. 1998.
[CEN/ISSS 200 I] Summaries of some Frameworks, Architectures, and Models for Electronic
Commerce, revision 1.a, Oct. 2001, CEN/ISSS Electronic Commerce Workshop, 200 I.
[Chekuri et a11997] Chandra Chekuri, Michael H. Goldwasser, Prabhakar Raghavan, and Eli
Upfal, Web search using automatic classification, in Proc. Of the 6th World Wide Web
Conference (WWW'6), 1997.
Ramesh R. Sarukkai 273
[Chen & George 2000] Y.H. Chen, and E. I. George, A Bayesian Model for Collaborative
Filtering (2000) (Department of MSIS, University ofTexas at Austin)
[Cheung et al1997] D. W. Chueng, B. kao, and J. W. Lee, Discovering user Access patterns
on the World-Wide Web, Proc. First Pacific-Asia Conference on Knowledge Discovery
and Data Mining (PAKDD-97).
[Cho et aI 1998] Junghoo Cho, Hector Garcia-Molina, and Lawrence Page, Efficient crawling
through URL ordering, in Proc. of 7th World Wide Web conference, 1998.
[Cho & Garcia-Molina 2oo0a] Junghoo Cho, Hector Garcia-Molina, The Evolution of the
Web and Implications for an incremental Crawler, In Proceedings of 26th International
Conference on Very Large Databases (VLDB), September 2000.
[Cho & Garcia-Molina 2000b] Junghoo Cho, Hector Garcia-Molina Synchronizing a database
to Improve Freshness, In Proceedings of 2000 ACM International Conference on
Management of Data (SIGMOD), May 2000.
[Clark 1999] David Clark, Preparing for a New Generation of Wireless Data, IEEE
Computer, pp:8-11, Aug. 1999.
[Croft et al1995] Croft, W. B., Cook, R., and Wilder, D., Providing government information
on the Internet: Experiences with THOMAS, in Proc. Of the Digital Libraries conference
(DL'95), 1995.
[De Bra et al1994] P. De Bra, G. J. Houben, Y. Kornatzky, and R. Post, Information retrieval
in distributed hypertextx, in Proc. of RlAO'94, Intel1igent Multimedia, Information
Retrieval Systems and Management, New York, NY, 1994.
[Dempster et a11977] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum Likelihood
from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society,
Series B (Methodological), 39(1):1--38,1977.
[Dreilinger & Home 1997] Daniel Dreilinger and Adele E. Home, Experiences with Selecting
Search Engines using Meta-Search, ACM Trans. On Information Systems, vol. 15, No.
3,pp: 195-222, July 1997.
[Duda & Hart 1983] R. O. Duda, and P.E. Hart, Pattern Classification and Scene Analysis,
New York: Wiley, 1973.
274 Foundations of Web Technology
[Ellison & Schneier 2000] C. Ellison, and B. Schneier, Ten Risks ofPKI: What You're not
being told about Public Key Infrastructure, Computer Security Journal, Vol. XVI, No. I,
pp:I-8,2000.
[Ester et a11996] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, A Density Based Algorithm
for Discovering Clusters in Large Spatial Databases, Proc. 1996 IntI. Conf. Knowledge
Discovery and Data Mining (KDD'96), pp: 226-231.
[Fisher 1936] R. A. Fisher, The use of multiple measurements in taxonomic problems, Ann.
Eugen., vol. 7,pp. 178-188, 1936.
[Foo et a120oo] Soo Mee Foo (Ed.), Ted Wugofski, Wei Meng Lee, Foo Soo Mee, Karli
Watson, Beginning WAP, WML, and WMLScript, Wrox Press, 2000.
[Frakes & Baeza-Yates 1993] Edited by William B. Frakes, and Ricardo Baeza-Yates,
Information Retrieval: Data Structures & Algorithms, Prentice Hall, 1993.
[Francis & Kucera 1982] Francis, W., and H. Kucera, Frequency Analysis ofEnglish Usage,
New York:Houghton Mifflin, 1982.
[Frost 2000] Martin Frost, Learning WML and WMLScript, O'Rielly 2000.
[Garfinkel et a12002] Simson Garfinkel, Gene Spafford, Debby Russell, Web Security,
Privacy and Commerce, O'Reilly & Associates, 2002.
[Garofalakis et a11999] John Garofalakis, Panagiotis Kappos, and Dimitris Mourloukos, Web
Site Optimization Using page Popularity, pp: 22-29, IEEE Internet Computing, July-
August 1999.
[Gauch et a11999] Susan Gauch, Jianying Wang, and Satya Mahesh Rachakanda, A corpus
analysis approach for automatic query expansion and its extension to multiple database,
ACM Trans. On Information Systems, vol. 17, no. 3, pp: 250-269, July 1999.
[Guelich et a12001] Scott Guelich, Shishir Gundavaram, and Gunther Birznieks, CGI
Programming with Perl, O'reilly, 2001.
[Gutherey 200 I] Scott Gutherey, Mobile Application Development: Using SMS and the SIM
toolkit, McGraw Hill, 2001.
[Hafer & Weiss 1974] Hafer, M., and S. Weiss,Word Segmentation by Letter Successor
Varieties, Iriformation Storage and Retrieval, 10,371-385,1974.
[Han et a12001] J. Han, M. Kamber, and A. K. H. Tung, Spatial Custering Methods in Data
Mining: A Survey, H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge
Discovery, Taylor and Francis, 2001.
[Hawking et al 1999] David Hawking, Nick Craswell, and Paul Thisdewaite, Results and
challenges in Web search evaluation, in Proc. Of the 8th World Wide Web Conference
(WWW'8), 1999.
[Hezinger et a11999] Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, and Marc
Najork, "Measuring index quality using random walks on the web", in Proc. of the Eighth
World Wide Web Conference (WWW'8), 1999.
[Hori et al 2000] Masahiro Hori, Goh Kondoh, Kouichi Ono, Shin-ichi Hirose, and Sandeep
Singhal, Annotation-Based Web Content Transcoding, Proc. of the 9th IntI. World Wide
Web conference, Amsterdam, May 2000.
[IETF IOTP Req. 2001] D. E. Eastlake, Internet Open Trading Protocol: Version 2
Requirements, IETF Internet Draft, 2001.
[IETF Pay IOTP 2001] W. Hans, Y. Kawatsura, and M. Hiroya, Payment API for v1.0
Internet Open Trading Protocol, IETF Internet Working Draft, 200 I.
[Kassinen et al 2000] Eija Kaasinen, Matti Aaltonen, Juha Kolari, Suvi Melakoski, Timo
Laakko, Two approaches to bringing Internet services to WAP devices, Proc. of the 9th
IntI. World Wide Web conference, Amsterdam, May 2000.
[Kay 2001] M. H. Kay, XSLT Programmer's Reference 2nd Edition, Wrox Press Inc., 2001.
[Kobielus 2001] J. Kobielus, XML Based Specifications for Security Interoperability, The
Burton Group, Network Strategy Overview, June 2001.
[Krishnamurthy & Rexford 2001] B. Krishnamurthy and J. Rexford, Web Protocols and
Practice: HTTP!}.}, Networking Protocols, Caching, and Traffic Measurement, Addison-
Wesley 2001.
[Lawrence & Giles 1999] Steve Lawrence and Lee Giles, Accessibility and distribution of
information on the web, Nature, 400, pp 107-109, 1999.
[Manber et al1997] Udi Manber, Mike Smith, Burra Gopal, WebGlimpse Combining
Browsing and Searching, Proceedings of 1997 Usenix Technical Conference, 1997.
[Mandelbrot 1983] B.B. Mandelbrot, The Fractal Geometry ofNature ,Freeman, New York,
1983.
[McBryan 1994] Oliver McBryan, GENVL and WWWW: Tools for Taming the Web, First
World Wide Web Conference, 1994.
Ramesh R. Sarukkai 277
[Mercer 2001] D. Mercer, XML: A Beginners Guide, Osborne, McGraw Hill 2001.
[Meyer 2000) E. A. Meyer, Cascading Style Sheets: The Definitive Guide, O'Reilly and
Associates, 2000.
[Miller & Bharat 1998) Robert C. Miller, and Krishna Bharat, SPHINX: A framework for
creating personal, site-specific Web crawlers, in Proc. of 7th World Wide Web
Conference, 1998.
[Najork & Wiener 2001) M. Najork, and 1. L. Wiener, Breadth-first search crawling yields
high-quality pages, Proc. Of lOth Inti. World Wide Web Conference (WWWIO), May 2-5,
Hong Kong, 200 I.
[Park et a11995) 1. S. Park, M.S. Chen, and P. S. Yu, An Effective Hash-Based Algorithm for
Mining Association Rules, 1995 SIGMOD, pp. 175-186.
[Perkowitz & Etzioni 1999) Mike Perkowitz and Oren Etzioni, Towards Adaptive Web Sites:
Conceptual Framework and Case Study, in Proc. of 8th World Wide Web Conference
(WWW8), Toronto, 1999.
[Porter 1980) Porter, M.F. ,An Algorithm for suffix stripping, Program, 14(3), 130-137,
1980.
[Quinlan 1993) J. R. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, San
Mateo, CA, 1993.
[Raggett et al 1998) Dave Raggett, Jenny Lam, Ian Alexander, and Michael Kmiec, Raggett
on HTML4, Addison-Wesley, 1998.
[RFC 1939] J. Myers, and M. Rose, Post Office Protocol (POP) version 3, RFC 1939, 1996.
[RFC 2060] M. Crispin, Internet Message Access Protocol- version 4revl, RFC 2060,1996
[RFC 2821) J. Klensin (Ed.), Simple Mail Transfer Protocol, RFC 2821, 200I.
[Rumelhart et al1987] D.E. Rumelhart, G.E. Hilton, and R.I. WiIliams, Learning internal
representations by error propagation, in Parallel and Distributed Processing, D. E.
278 Foundations of Web Technology
Rumelhart, and D. McClel1and, Eds., vol. I. Cambridge, MA: MIT Press, 1987, pp. 318-
362.
[Sarukkai 2000] R. R. Sarukkai. Link prediction and path analysis using markov chains. In
Computer Networks, pages 1--6, June, 2000, also presented at the 9th World Wide Web
Conference (WWW9) Amsterdam.
[Sarukkai 2002) R. R. Sarukkai. The Synaptic World Wide Web. Manuscript in preparation.
[Schwartz & Pheonix 2001] Randal L. Schwartz, & Tom Phoenix, Learning Perl (3rd
Edition), O'Reilly, 2001.
[SciAmer 1999] Members of the IBM Clever Project, Hypersearching the Web, in Scientific
American, June 1999.
[Shahabi et aI 1997) Cyrus Shahabi, Amir M. Zarkesh, Jafar Adibi, and Vishal Shah,
Kowledge Discovery from Users Web-Page navigation, IEEE RIDE 1997.
[Shardanand & Maes 1995) Upendra Shardanand and Pattie Maes, Social Information
Filtering: Algorithms for Automating Word of Mouth, Proc. OfCHI95, Denver, Colorado,
USA (May 7 - 11, 1995).
[Soboroff & Nicholas 1999] Ian M. Soboroff and Charles K. Nicholas. Combining content
and collaboration in text filtering. In Thorsten Joachims, editor, Proceedings ofthe
IJCAl'99 Workshop on Machine Learning in Information Filtering, pages 86--91,
Stockholm, Sweden, August 1999.
[Srikant & Agarwal 1995) R. Srikant, and R. Agarwal, Mining Generalized Association
Rules, in Proc. Of the 21" Conference on Very Large Databases, 1995.
[Srikant & Yang 2001) R. Srikant, and Y. Yang, Mining Web Logs to Improve Website
Organization, in Proc. Of the 10th IntI. World Wide Web Conference (WWWIO), May 1-5
2001.
[Srikant et a11997] R. Srikant, Q. Vu, and R. Agarwal, Mining Association Rules with Item
Constraints, in Proc. OfJ'd IntI. Conference on Knowledge Discovery in Databases and
Data Mining, Newport Beach, California, Aug. 1997.
Ramesh R. Sarukkai 279
[Stevens 1994] W. R. Stevens, The Protocols (TCPI/P Illustrated volume I). Addison-Wesley
1994.
rd
[Tanenbaum 1996] A. S. Tanenbaum, Computer Networks: 3 Edition, Prentice Hall PTR,
1996.
[Wang el al1997] W. Wang, 1. Yang, and R. Muntz. STING: A Statistical Information Grid
Approach to Spatial Data Mining. Proc. 1997 IntI. Conf. Very Large Databases
(VLDB'97), pp: 186-195.
[Wexelblat & Maes 1999] Alan Wexelblat and Pattie Maes, Footprints: History-Rich Tools
for information Foraging, CHI'99.
[Wu et al2001j Z. Wu, W. Meng, C. Yu, and Z. Li, Towards a Highly-Scalable and Effective
Metasearch Engine, in Proc. Of the 10th IntI. World Wide Web Conference (WWWIO),
May 2-5, 2001.
[WWWGoogle] http://www.google.com!
[WWW.NET] http://www.Microsoft.com!net
[WWW SearchEngineWatch]
http://searchenginewatch.Intemet.com/reports/sizes.html
[WWW SearchDireclory Sites] http://www.yahoo.com!, http://google.yahoo.com!,
http://www.rnsn.com!
[WWWWebSphere] http://www.ibm.com!websphere
280 Foundations of Web Technology
[2amir & Etzioni 1998] Oren 2amir, and Oren Etzioni, Grouper: A Dynamic Clustering
Interface to Web Search Results, in Proc. Of the 7th World Wide Web Conference
(WWW'7), 1998.
[Zipf 1949] G. Zipf, Human Behavior and the Principle ofLeast Effort. Reading, MA:
Addison-Wesley, 1949.
Acronyms
Indexing 96
FastMap information retrieval xix, 3, 4, 5, 6, 7, 88,
clustering 163 110, 125,251,252,271
Fetuccino 128 instant messaging xix, 178, 184, 186,253
FishSearch 127 Interactive Financial Exchange
Focussed Crawling 127 e<ommerce 189,194
Forms Internet Cache Protocol (lCP) 67
HTML47 Internet Message Access Protocol
fraud protection 198 (lMAP)183
Intemet Open Trading Protocol (IOTP)
General Packet Radio Service (GPRS) e<ommerce 192
231 Inverse Document Frequency 100
gini'index Inverted fJles 96
decision trees 152 IP
Global System for Mobile internet protocol 55
Communication (GSM) 209 IPv657
GPRF itemsets
clustering 161 association mining 144
grid based
clustering 162 Java E-Commerce framework
GSM209 e<ommerce platforms 191
Support
Association rules 143 XHTML 48,50,219,220,223,251,272,
Support Vector Machines 150 284
synunetric keys XML 13,51
security 81 Xpath41
00018,239,240,241,253
OOP63
Unique Universal Identifier (UUID)
IFX, e-commerce 196
unsupervised learning 157
URL58
User Datagram Protocol 63
Valid Document 28