Navi-2. Literature Survey

Facilitating Effective User Navigation 2.
LITERATURE SURVEY
2. LITERATURE SURVEY
2.1. A Survey of Web Metrics
The unabated growth and increasing significance of the World Wide Web has
resulted in a flurry of research activity to improve its capacity for serving information
more effectively. But at the heart of these efforts lie implicit assumptions about
“quality” and “usefulness” of Web resources and services. This observation points
towards measurements and models that quantify various attributes of web sites. The
science of measuring all aspects of information, especially its storage and retrieval or
informatics has interested information scientists for decades before the existence of
the Web. Is Web informatics any different, or is it just an application of classical
informatics to a new medium? In this paper, we examine this issue by classifying and
discussing a wide ranging set of Web metrics. We present the origins, measurement
functions, formulations and comparisons of well known Web metrics for quantifying
Web graph properties, web page significance, web page similarity, search and
retrieval, usage characterization and information theoretic properties. We also discuss
how these metrics can be applied for improving Web information access and use.
In this paper we have reviewed and classified some well known Web metrics.
Our approach has been to consider these metrics in the context of improving Web
content while intuitively explaining their origins and formulations. This analysis is
fundamental to modeling the phenomena that give rise to the measurements. To our
knowledge this is the first survey that incorporates an extensive treatment of wide
range of metrics and measurement functions. Nevertheless, we do not claim this
survey is complete and acknowledge any omissions. We hope that this initiative
would serve as a reference point for further evolution of new metrics for
characterizing and quantifying information on the Web and developing the
explanatory models associated with them.
Page | 3
AVANTHI Institute Of Engineering & Technology,
Gunthapally, RR Dist.
Department of Computer Science Engineering
Facilitating Effective User Navigation 2. LITERATURE SURVEY
2.2. Towards Adaptive Web Sites: Conceptual Framework and Case

Study:
The creation of a complex web site is a thorny problem in user interface

design. In this paper we explore the notion of adaptive web sites: sites that semi-
automatically improve their organization and presentation by learning from visitor
access patterns. It is easy to imagine and implement websites that offer shortcuts to
popular pages. Are more sophisticated adaptive web sites feasible? What degree of
automation can we achieve?
To address the questions above, we describe the design space of adaptive web sites
and consider a case study: the problem of synthesizing new index pages that facilitate
navigation of a web site. We present the PageGather algorithm, which automatically
identies candidate link sets to include in index pages based on user access logs. We
demonstrate experimentally that PageGather outperforms the Apriority data mining
algorithm on this task. In addition, we compare PageGather's link sets to pre-existing,
human-authored index pages. The work reported in this paper is part of our ongoing
research effort to develop adaptive web sites and increase their degree of automation.
We list our main contributions below:
1. We motivated the notion of adaptive web sites and analyzed the design space for
such sites, locating previous work in that space.
2. To demonstrate the feasibility of non-trivial adaptations, we presented a case study

in the domain of synthesizing new index pages. We identied a key sub problem that is
amenable to an automatic solution.
3. Next, we presented the fully-implemented PageGather algorithm for discovering

candidate index page contents based on visitor access patterns extracted from web
server logs.
4. We experimentally compared PageGather's output with the frequent sets discovered

by the A priori data mining algorithm and with human-authored index pages. Finally,
we identied the generation of complete and pure index pages as a key next step in the
automation of index page synthesis. Index page synthesis itself is a step towards the
Page | 4
long-term goal of change in view: adaptive sites that automatically suggest alternative
organizations of their contents based on visitor access patterns.
2.3. Closing the Loop in Webpage Understanding
The two most important tasks in information extraction from the Web are
webpage structure understanding and natural language sentences processing.
However, little work has been done towards an integrated statistical model for
understanding webpage structures and processing natural language sentences within
the HTML elements. Our recent work on webpage understanding introduces a joint
model of Hierarchical Conditional Random Fields (i.e. HCRF) and extended Semi-
Markov Conditional Random Fields (i.e. Semi-CRF) to leverage the page structure
understanding results in free text segmentation and labeling. In this top-down
integration model, the decision of the HCRF model could guide the decision-making
of the Semi-CRF model. However, the drawback of the top-down integration strategy
is also apparent, i.e., the decision of the Semi-CRF model could not be used by the
HCRF model to guide its decision-making. This paper proposed a novel framework
Called WebNLP, which enables bidirectional integration of page structure
understanding and text understanding in an iterative manner. We have applied the
proposed framework to local business entity extraction and Chinese person and
organization name extraction. Experiments show that the WebNLP framework
achieved significantly better performance than existing methods.
Webpage understanding plays an important role in web search and mining. It

contains two main tasks, i.e., page structure understanding and natural language
understanding. However, little work has been done towards an integrated statistical
model for understanding webpage structures and processing natural language
sentences within the HTML elements. In this paper, we introduced the WebNLP
framework for webpage understanding. It enables bidirectional integration of page
structure understanding and natural language understanding. Specifically, the
WebNLP framework is composed of two models, i.e., the extended HCRF model for
structure understanding and the extended Semi-CRF model for text understanding.
The performance of both models can be boosted in the iterative optimization
Page | 5
procedure. The auxiliary corpus is introduced to train the statistical language features
in the extended Semi-CRF model for text understanding, and the multiple occurrence
features are also used in the extended Semi-CRF model by adding the decision of the
model in last iteration. Therefore, the extended Semi-CRF model is improved by
using both the label of the vision nodes assigned by the HCRF model and the text
segmentation and labeling results, given by the extended Semi-CRF model itself in
last iteration as additional input parameters in some feature functions; the extended
HCRF model benefits from the extended Semi- CRF model via using the
segmentation and labeling results of the text strings explicitly in the feature functions.
The WebNLP framework closes the loop in webpage understanding for the first time.
The experimental results show that the WebNLP framework performs significantly
better than the state-of-the-art algorithms on English local entity extraction and
Chinese named entity extraction on WebPages.
2.4. Mining Web Informative Structures and Contents Based on

Entropy Analysis
In this paper, we study the problem of mining the informative structure of a

news Web site that consists of thousands of hyperlinked documents. We define the
informative structure of a news Web site as a set of index pages (or referred to as
TOC, i.e., table of contents, pages) and a set of article pages linked by these TOC
pages. Based on the Hyperlink Induced Topics Search (HITS) algorithm, we propose
an entropy-based analysis (LAMIS) mechanism for analyzing the entropy of anchor
texts and links to eliminate the redundancy of the hyperlinked structure so that the
complex structure of a Web site can be distilled. However, to increase the value and
the accessibility of pages, most of the content sites tend to publish their pages with
intra site redundant information, such as navigation panels, advertisements, copy
announcements, etc. To further eliminate such redundancy, we propose another
mechanism, called InfoDiscoverer, which applies the distilled structure to identify sets
of article pages. InfoDiscoverer also employs the entropy information to analyze the
information measures of article sets and to extract informative content blocks from
these sets. Our result is useful for search engines, information agents, and crawlers to
Page | 6
index, extract, and navigate significant information from a Web site. Experiments on
several real news Web sites show that the precision and the recall of our approaches
are much superior to those obtained by conventional methods in mining the
informative structures of news Web sites. On the average, the augmented LAMIS
leads to prominent performance improvement and increases the precision by a factor
ranging from 122 to 257 percent when the desired recall falls between 0.5 and 1. In
comparison with manual heuristics, the precision and the recall of InfoDiscoverer are
greater than 0.956.
In the paper, we propose a system, composed of LAMIS and InfoDiscoverer,
to mine informative structures and contents from Web sites. Given an entrance URL
of a Web site, our system is able to crawl the site, parse its pages, analyze entropies of
features, links and contents, and mine the informative structures and contents of the
site. With a fully automatic flow, the system is useful to serve as a preprocessor of
search engines and Web miners (information extraction systems). The system can also
be applied to various Web applications with its capability of reducing a complex Web
site structure to a concise one with informative contents. During performing
experiments of LAMIS, we found that the HITS-related algorithms are not good
enough to be applied in mining the informative structures, even when the link entropy
is considered. Therefore, we developed and investigated several techniques to
enhance the that LAMIS-LN-CB-HR-TW was able to achieve the optimal solution in
most cases. The R-Precision 0.82 indicates that the enhanced LAMIS performs very
well in mining the informative structure of a Web site. The result of InfoDiscoverer
also shows that both recall and precision rates are larger than 0.956, which is very
close to the hand-coding result. In the future, we are interested in the further
enhancement of our system. For example, the concept of generalization/specialization
can be applied to find the optimal granularity of blocks to be utilized in LAMIS and
InfoDiscoverer. Also, our proposed mechanisms are significant for and are worthy of
further deployment in several Web domain-specific studies, including those for Web
miners and intelligent agents. These are matters of future research.
Page | 7
2.5. Automatic Template Extraction from Heterogeneous Web Pages
Extracting structured information from unstructured and/or semi-structured

machine-readable documents automatically plays a major role now a days, So most
websites are using common templates with contents to populate the information to
achieve good publishing productivity, Where WWW are the major resource for
extracting the information. In recent days Template detection technique received lot
of concentration to improve in different aspects like performance of search engine ,
clustering and classification of web documents , as templates degrade the performance
and accuracy of web application for a machines because of irrelevant template terms.
In this paper, we present novel algorithms for extracting templates from a large
number of web documents which are generated from heterogeneous templates. Using
the similarity of underlying template structures in the document we cluster the web
documents so that template for each cluster is extracted simultaneously. Thus,
applying the real-life data sets the efficiency of our algorithms can be considered to
the best among template detection algorithms.
We introduced a novel approach of the template detection from heterogeneous

web documents. We employed the MDL principle to manage the unknown number of
clusters and to select good partitioning from all possible partitions of documents, and
then, introduced our extended MinHash technique to speed up the clustering process.
Experimental results with real life data sets confirmed the effectiveness of our
algorithms.
2.6. Toward an Adaptive Web: The State of the Art and Science
As the World Wide Web matures, it makes leaps forward in both size and
complexity. In this expanding environment, the needs and interests of individual users
become buried under the sheer weight of possible viewing choices. To counter this,
there has been a rise in research in adaptive websites, a combination of data mining,
machine learning, user modeling, Human Computer Interaction (HCI), optimization
theory and graph theory which seeks to sift through the tides of possible pages to
provide users with a high-quality stream of information. This paper provides a
Page | 8
description of adaptive website research, including the goals aimed at, the challenges
discovered and the approaches to solutions.
This paper has presented an overview of the goals, challenges, approaches and
implementations that surround adaptive website research. This work is meant to
provide an introduction to many of the most important difficulties, characteristics and
solutions that have occurred to date, but is not intended to be an exhaustive overview.
Readers are directed to for additional good overviews of the topic. The title of this
paper suggests the nature of the problem as both an art and a science. While
considerable research has been performed into studying how users behave in a web
environment, the relationships between pages, how to rate and rank suggestions to
users, etc., there is still considerable art involved in producing effective adaptive web
systems, from the choice of particular parameters of clustering algorithms to the
measuring of the effectiveness of a particular adaptation, for example. Nonetheless,
substantial work has been done to explore the problem from three basic directions:
understanding users, understanding websites, and understanding information. Most
approaches seem to examine the problem from one of these directions; some examine
it from two of these directions; but few (if any) consider all three directions. We are in
the midst of the early stages of the problem, where there is primarily analysis being
performed, with the problem not yet sufficiently explored to allow broader synthesis
to occur. It is expected, however, that for a substantial portion of time there will
remain a large proportion of the problem which can only be solved via the sound and
steady application of considerable art, backed by the driving solidity of science.
2.7. Web Mining for Web Personalization
Web personalization is the process of customizing a Web site to the needs of

specific users, taking advantage of the knowledge acquired from the analysis of the
user’s navigational behavior (usage data) in correlation with other information
collected in the Web context, namely structure, content and user profile data. Due to
the explosive growth of the Web, the domain of Web personalization has gained great
momentum both in the research and the commercial area. In this paper we present a
Page | 9
survey of the use of Web mining for Web personalization. More specifically, we
introduce the modules that comprise a Web personalization system, emphasizing on
the Web usage mining module. A review of the most common methods that are used
as well as technical issues that occur is given, along with a brief overview of the most
popular tools and applications available from software vendors. Moreover, the most
important research initiatives in the Web usage mining and personalization area are
presented.
Web personalization is the process of customizing the content and the

structure of a Web site to the specific and individual needs of each user, without
requiring from them to ask for it explicitly. This can be achieved by taking advantage
of the user’s navigational behavior, as it can be revealed through the processing of the
Web usage logs, as well as the user’s characteristics and interests. Such information
can be further analyzed in association with the content of a Web site, resulting to
improvement of the system performance, users’ retention and/or site modification.
The overall process of Web personalization consists of five modules, namely:

user profiling, log analysis and Web usage mining, information acquisition, content
management and Web site publishing. User profiling is the process of gathering
information specific to each visitor of a Web site either implicitly, using the
information hidden in the Web logs or technologies such as cookies, or explicitly,
using registration forms, questionnaires etc. Such information can be demographic,
personal or even information concerning the user’s navigational behavior. However,
many of the methods used in user profiling rise some privacy issues concerning the
disclosure of the user’s personal data, therefore they are not recommended. Since user
profiling seems essential in the process of Web personalization, a legal and more
accurate way of acquiring such information is needed. P3P is an emerging standard
recommended by W3C that provides a technical mechanism that enables users to be
informed about privacy policies before they release personal information and gives
them control over the disclosure of their personal data.
The main component of a Web personalization system is the usage miner. Log
analysis and Web usage mining is the procedure where the information stored in the
Page | 10
Web server logs is processed by applying statistical and data mining techniques, such
as clustering, association rules discovery, classification and sequential pattern
discovery, in order to reveal useful patterns that can be further analyzed. Such
patterns differ according to the method and the input data used, and can be user and
page clusters, usage patterns and correlations between user groups and Web pages.
Those patterns can then be stored in a database or a data cube and query mechanisms
or OLAP operations can be performed in combination with visualization techniques.
The most important phase of Web usage mining is data filtering and pre-processing.
In that phase, Web log data should be cleaned or enhanced, and user, session and page
view identification should be performed. Web personalization is a domain that has
been recently gaining great momentum not only in the research area, where many
research teams have addressed this problem from different perspectives, but also in
the industrial area, where there exists a variety of tools and applications addressing
one or more modules of the personalization process. Enterprises expect that by
exploiting the information hidden in their Web server logs they could discover the
interactions between their Web site visitors and the products offered through their
Web site. Using such information, they can optimize their site in order to increase
sales and ensure customer retention. Apart from Web usage mining, user profiling
techniques are also employed in order to form a complete customer profile. Lately,
there is an effort to incorporate Web content in the recommendation process, in order
to enhance the effectiveness of personalization. However, a solution that combines
efficiently techniques used in user profiling, Web usage mining, content acquisition
and management as well as Web publishing has not yet been proposed.
2.8. From User Access Patterns to Dynamic Hypertext Linking
This paper describes an approach for automatically classifying visitors of a

web site according to their access patterns. User access logs are examined to discover
clusters of users that exhibit similar information needs; e.g., users that access similar
pages. This may result in a better understanding of how users visit the site, and lead to
an improved organization of the hypertext documents for navigational convenience.
More interestingly, based on what categories an individual user falls into, we can
Page | 11
dynamically suggest links for him to navigate. In this paper, we describe the overall
design of a system that implements these ideas, and elaborate on the preprocessing,
clustering, and dynamic link suggestion tasks. We present some experimental results
generated by analyzing the access log of a web site.
We have presented a system design that facilitates the analysis of past user
access patterns to discover common user access behavior. This information can then
be used to improve the static hypertext structure, or to dynamically insert links to web
pages. We have implemented the offline module and the session-logging web server,
and started work on the online module. We are distributing the offline module as
public domain software:
ftp://www-db.stanford.edu/pub/analog/analog.0.1.tar.Z
Web administrators may find the tool useful for analyzing user access logs
generated by a NCSA http server. Our experimental results obtained by analyzing real
user access logs show that indeed clusters of user access patterns exist. Further, some
of these clusters are not apparent from the physical linkage of the pages, and thus
would not be identified without looking at the logs. For future work, we will look
into how to capture the order of accesses to better represent user interests, the use of
semantic information to model user interests, the impact of different clustering
algorithms on the quality of the cluster information, and the effectiveness of the
suggestions given to the users (i.e., we need to evaluate whether the users find the
suggestions useful).
2.9. A Hybrid Web Personalization Model Based on Site Connectivity
Web usage mining has been used effectively as an underlying mechanism for
Web personalization and recommender systems. A variety of recommendation
frameworks have been proposed, including some based on non-sequential models,
such as association rules and clusters, and some based on sequential models, such as
sequential or navigational patterns. Our recent studies have suggested that the
structural characteristics of Web sites, such as the site topology and the degree of
Page | 12
connectivity, have a significant impact on the relative performance of

recommendation models based on association rules, contiguous and non-contiguous
sequential patterns. In this paper, we present a framework for a hybrid Web
personalization system that can intelligently switch among different recommendation
models, based on the degree of connectivity and the current location of the user within
the site. We have conducted a detailed evaluation based on real Web usage data from
three sites with different structural characteristics.
Our results show that the hybrid system selects less constrained models such
as frequent item sets when the user is navigating portions of the site with a higher
degree of connectivity, while sequential recommendation models are chosen for
deeper navigational depths and lower degrees of connectivity. The comparative
evaluation also indicates that the overall performance of hybrid system in terms of
precision and coverage is better than the recommendation systems based on any of the
individual models.
Generally speaking, sequential recommendation models (such as those based

on sequential navigational patterns) produce fairly accurate recommendations, but
such models do not generate enough recommendations and often result in
unacceptably low coverage. In contrast, recommendation models based on less
constrained patterns such as clustering and association rules can capture broader range
of recommendations, but they often lack in accuracy when compared to sequential
models. Our previous studies have shown that the performance of each
recommendation model depends, in part, on the structural characteristics of the Web
site and the degree of hyperlink connectivity, in particular. In this paper, we have
presented a framework for a hybrid Web personalization framework that can
intelligently switch among different recommendation models, based on a localized
connectivity measure. Our studies shows that the hybrid system selects less
constrained models such as frequent item sets when the user is navigating portions of
the site with a higher degree of connectivity, and it selects sequential recommendation
models for deeper navigational paths and lower degrees of connectivity. Overall, the
results presented here show that the hybrid model can be used to develop a more
effective and intelligent personalization framework when compared with any of the
Page | 13
individual sequential or non-sequential models. In particular, our hybrid recommender

system can generate not only accurate but also a wider range of recommendations.
2.10. Data Mining for Web Personalization
In this chapter we present an overview of Web personalization process viewed

as an application of data mining requiring support for all the phases of a typical data
mining cycle. These phases include data collection and preprocessing, pattern
discovery and evaluation, and finally applying the discovered knowledge in real-time
to mediate between the user and the Web. This view of the personalization process
provides added flexibility in leveraging multiple data sources and in effectively using
the discovered models in an automatic personalization system. The chapter provides a
detailed discussion of a host of activities and techniques used at different stages of
this cycle, including the preprocessing and integration of data from multiple sources,
as well as pattern discovery techniques that are typically applied to this data. We
consider a number of classes of data mining algorithms used particularly for Web
personalization, including techniques based on clustering, association rule discovery,
sequential pattern mining, Markov models, and probabilistic mixture and hidden
(latent) variable models. Finally, we discuss hybrid data mining frameworks that
leverage data from a variety of channels to provide more effective personalization
solutions.
In this chapter we have presented a comprehensive discussion the Web

personalization process viewed as an application of data mining which must therefore
be supported during the various phases of a typical data mining cycle. We have
discussed a host of activities and techniques used at different stages of this cycle,
including the preprocessing and integration of data from multiple sources, and pattern
discovery techniques that are applied to this data. We have also presented a number of
specific recommendation algorithms for combining the discovered knowledge with
the current status of a user’s activity in a Web site to provide personalized content to a
user. The approaches we have detailed show how pattern discovery techniques such
as clustering, association rule mining, and sequential pattern discovery, and
Page | 14
probabilistic models performed on Web usage collaborative data, can be leveraged

effectively as an integrated part of a Web personalization system. While a research
into personalization has led to a number of effective algorithms and commercial
success stories, a number of challenges and open questions still remain.
A key part of the personalization process is the generation of user models. The
most commonly used user models are still rather simplistic, representing the user as a
vector of ratings or using a set of keywords. Even where more multi-dimensional or
ontological information has been available, the data is generally mapped onto a single
user-item table which is more amenable for most data mining and machine learning
techniques. To provide the most useful and effective recommendations,
personalization systems need to incorporate more expressive models. Some of the
discussion on the integration of semantic knowledge and technologies in the mining
process suggests that some strides have been made in this direction. However, most of
this work has not, as of yet, resulted in true and tested approaches that can become the
basis of the next generation personalization systems. Another important and difficult
of challenge is the modeling of user context. In particular profiles commonly used
today lack in their ability to model user context and dynamics. Users access different
items for different reasons and under different contexts. The modeling of context and
its use within recommendation generation needs to be explored further. Also, user
interests and needs change with time. Identifying these changes and adapting to them
is a key goal of personalization. However, very little research effort has been
expended the evolution of user patterns over time and their impact on
recommendations. This is in part due to the trade-offs between expressiveness of the
profiles and scalability with respect to the number of active users. Solutions to these
important challenges are likely to lead to the creation of the next generation of more
effective and useful Web personalization and recommender systems that can be
deployed in increasingly more complex Web-based environments.
Page | 15

Navi-2. Literature Survey

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Navi-2. Literature Survey

Uploaded by

Copyright:

Available Formats

Facilitating Effective User Navigation 2.

2.1. A Survey of Web Metrics

2.2. Towards Adaptive Web Sites: Conceptual Framework and Case

The creation of a complex web site is a thorny problem in user interface

2. To demonstrate the feasibility of non-trivial adaptations, we presented a case study

3. Next, we presented the fully-implemented PageGather algorithm for discovering

4. We experimentally compared PageGather's output with the frequent sets discovered

2.3. Closing the Loop in Webpage Understanding

Webpage understanding plays an important role in web search and mining. It

2.4. Mining Web Informative Structures and Contents Based on

In this paper, we study the problem of mining the informative structure of a

2.5. Automatic Template Extraction from Heterogeneous Web Pages

Extracting structured information from unstructured and/or semi-structured

We introduced a novel approach of the template detection from heterogeneous

2.7. Web Mining for Web Personalization

Web personalization is the process of customizing a Web site to the needs of

Web personalization is the process of customizing the content and the

The overall process of Web personalization consists of five modules, namely:

2.8. From User Access Patterns to Dynamic Hypertext Linking

This paper describes an approach for automatically classifying visitors of a

2.9. A Hybrid Web Personalization Model Based on Site Connectivity

connectivity, have a significant impact on the relative performance of

Generally speaking, sequential recommendation models (such as those based

individual sequential or non-sequential models. In particular, our hybrid recommender

2.10. Data Mining for Web Personalization

In this chapter we present an overview of Web personalization process viewed

In this chapter we have presented a comprehensive discussion the Web

probabilistic models performed on Web usage collaborative data, can be leveraged

You might also like