Download as docx or pdf
Download as docx or pdf
You are on page 1of 23

Web mining

Abstract:

Two important and active areas of current research are data mining and the World Wide Web. A natural
combination of the two areas, sometimes referred to as Web mining, has been the focus of several recent
research projects and papers. As with any emerging research area there is no established vocabulary,
leading to confusion when comparing research efforts. Different terms for the same concept or different
definitions being attached to the same word are commonplace. The term Web mining has been used in
two distinct ways. The first, which is referred to as Web content mining in this paper, describes the
process of information or resource discovery from millions of sources across the World Wide Web. The
second, which we call Web usage mining, is the process of mining Web access logs or other user
information user browsing and access patterns on one or more Web localities. In this paper we define
Web mining and, in particular, present an overview of the various research issues, techniques, and
development efforts in Web content mining and Web usage mining. We focus mainly on the problems
and proposed techniques associated with Web usage mining as an emerging research area. We also
present a general architecture for Web usage mining and briefly describe the WEBMINER, a system
based on the proposed architecture. We conclude this paper by listing issues that need the attention of the
research community.
Introduction

Web mining - is the application of data mining techniques to discover patterns from the Web.
According to analysis targets, web mining can be divided into three different types, which are Web
usage mining, Web content mining and Web structure mining.

Web mining is a technique that applies data mining techniques to analyse different sources of
data in the web (such as web usage data, web content data, and web structural data). With the
rapidly growth of World Wide Web, it therefore becomes a very hot and popular topic in web
research area. E-commerce and E-services are claimed could be killer applications for web
mining, and web mining now also plays an important role for E-commerce website and E-
services to understand how their websites and services are used and to provide better services
for their customers and users.

Recently, many researches are focusing on developing new web mining techniques and
algorithms, or devoting to improve traditional mining techniques. However, it is meaningless, if
these techniques and algorithms have not been applied in real application environment.
Therefore, it could be an important time to shift the research focus to application area, such as
E-commerce and E-services

With the explosive growth of information sources available on the World Wide Web, it has become
increasingly necessary for users to utilize automated tools in find the desired information resources, and
to track and analyze their usage patterns. These factors give rise to the necessity of creating server side
and client side intelligent systems that can effectively mine for knowledge. Web mining can be broadly
defined as the discovery and analysis of useful information from the World Wide Web. This describes
the automatic search of information resources available on line, i.e. Web content mining, and the
discovery of user access patterns from Web servers, i.e., Web usage mining.

What is Web Mining ?

Web Mining is the extraction of interesting and potentially useful patterns and implicit information from
artifacts or activity related to the World Wide Web. There are roughly three knowledge discovery
domains that pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage
Mining. Web content mining is the process of extracting knowledge from the content of documents or
their descriptions. Web document text mining, resource discovery based on concepts indexing or agent
based technology may also fall in this category. Web structure mining is the process of inferring
knowledge from the World Wide Web organization and links between references and referents in the
Web. Finally, web usage mining, also known as Web Log Mining, is the process of extracting interesting
patterns in web access logs.

Web Content Mining Web content mining is an automatic process that goes beyond keyword extraction.
Since the content of a text document presents no machine readable semantic, some approaches have
suggested to restructure the document content in a representation that could be exploited by machines.
The usual approach to exploit known structure in documents is to use wrappers to map documents to
some data model. Techniques using lexicons for content interpretation are yet to come. There are two
groups of web content mining strategies: Those that directly mine the content of documents and those
that improve on the content search of other tools like search engines.

Web Structure Mining World Wide Web can reveal more information than just the information
contained in documents. For example, links pointing to a document indicate the popularity of the
document, while links coming out of a document indicate the richness or perhaps the variety of topics
covered in the document. This can be compared to bibliographical citations. When a paper is cited often,
it ought to be important. The PageRank and CLEVER methods take advantage of this information
conveyed by the links to find pertinent web pages. By means of counters, higher levels cumulate the
number of artifacts subsumed by the concepts they hold. Counters of hyperlinks, in and out documents,
retrace the structure of the web artifacts summarized.

Web Usage Mining Web servers record and accumulate data about user interactions whenever requests
for resources are received. Analyzing the web access logs of di#erent web sites can help understand the
user behaviour and the web structure, thereby improving the design of this colossal collection of
resources. There are two main tendencies in Web Usage Mining driven by the applications of the
discoveries: General Access Pattern Tracking and Customized Usage Tracking.The general access pattern
tracking analyzes the web logs to understand access patterns and trends. These analyses can shed light
on better structure and grouping of resource providers. Many web analysis tools existd but they are
limited and usually unsatisfactory. We have designed a web log data mining tool, WebLogMiner, and
proposed techniques for using data mining and OnLine Analytical Processing (OLAP) on treated and
transformed web access files. Applying data mining techniques on access logs unveils interesting access
patterns that can be used to restructure sites in a more efficient grouping, pinpoint effective advertising
locations, and target specific users for specific selling ads Customized usage tracking analyzes individual
trends. Its purpose is to customize web sites to users. The information displayed, the depth of the site
structure and the format of the resources can all be dynamically customized for each user over time
based on their access patterns.

While it is encouraging and exciting to see the various potential applications of web log file analysis, it is
important to know that the success of such applications depends on what and how much valid and
reliable knowledge one can discover from the large raw log data. Current web servers store limited
information about the accesses. Some scripts custom tailored for some sites may store additional
information. However, for an effective web usage mining, an important cleaning and data
transformation step before analysis may be needed.

Web usage mining


Web usage mining is the process of finding out what users are looking for on internet. Some users might
be looking at only textual data whereas some other might want to get multimedia data. Web usage
mining also helps finding the search pattern for a particular group of people belonging to a particular
region.

Web structure mining

Web structure mining is the process of using graph theory to analyse the node and connection structure
of a web site. According to the type of web structural data, web structure mining can be divided into
two kinds.

The first kind of web structure mining is extracting patterns from hyperlinks in the web. A hyperlink is a
structural component that connects the web page to a different location. The other kind of the web
structure mining is mining the document structure. It is using the tree-like structure to analyse and
describe the HTML (Hyper Text Markup Language) or XML (eXtensible Markup Language) tags within the
web page.

Web mining Pros and Cons

Pros

Web mining essentially has many advantages which makes this technology attractive to corporations
including the government agencies. This technology has enabled ecommerce to do personalized
marketing, which eventually results in higher trade volumes. The government agencies are using this
technology to classify threats and fight against terrorism. The predicting capability of the mining
application can benefits the society by identifying criminal activities. The companies can establish better
customer relationship by giving them exactly what they need. Companies can understand the needs of
the customer better and they can react to customer needs faster. The companies can find, attract and
retain customers; they can save on production costs by utilizing the acquired insight of customer
requirements. They can increase profitability by target pricing based on the profiles created. They can
even find the customer who might default to a competitor the company will try to retain the customer
by providing promotional offers to the specific customer, thus reducing the risk of losing a customer.

Cons
Web mining the technology itself doesn’t create issues, but this technology when used on data of
personal nature might cause concerns. The most criticized ethical issue involving web mining is the
invasion of privacy. Privacy is considered lost when information concerning an individual is obtained,
[1]
used, or disseminated, especially if this occurs without their knowledge or consent . The obtained data
will be analyzed, and clustered to form profiles; the data will be made anonymous before clustering so
that no individual can be linked directly to a profile. But usually the group profiles are used as if they are
personal profiles [1]. Thus these applications de-individualize the users by judging them by their mouse
clicks. De-individualization, can be defined as a tendency of judging and treating people on the basis of
group characteristics instead of on their own individual characteristics and merits. Another important
concern is that the companies collecting the data for a specific purpose might use the data for a totally
different purpose, and this essentially violates the user’s interests. The growing trend of selling personal
data as a commodity encourages website owners to trade personal data obtained from their site. This
trend has increased the amount of data being captured and traded increasing the likeliness of one’s
privacy being invaded. The companies which buy the data are obliged make it anonymous and these
companies are considered authors of any specific release of mining patterns. They are legally
responsible for the contents of the release; any inaccuracies in the release will result in serious lawsuits,
but there is no law preventing them from trading the data.Some mining algorithms might use
controversial attributes like sex, race, religion, or sexual orientation to categorize individuals. These
[2]
practices might be against the anti-discrimination legislation. The applications make it hard to identify
the use of such controversial attributes, and there is no strong rule against the usage of such algorithms
with such attributes. This process could result in denial of service or a privilege to an individual based on
his race, religion or sexual orientation, right now this situation can be avoided by the high ethical
standards maintained by the data mining company. The collected data is being made anonymous so
that, the obtained data and the obtained patterns cannot be traced back to an individual. It might look
as if this poses no threat to one’s privacy, actually many extra information can be inferred by the
application by combining two separate unscrupulous data from the user
ANGOSS KnowledgeWebminer, finds patterns in web log data.

Anthracite, web mining desktop toolkit (for MacOS X).

Blossom Software, provides data gathering, search engine and Web-site enhancement services,
including bots for retrieving data from the Web

ECONDA, German data mining and web analytics software company, spin-off from U. of Karlsruhe,
specializing in analytics tools for online shops and web portals.

IBM SurfAid, technology and services for analyzing and interpreting the behavior of Web site visitors.

ICEtracks  provides visitor identification, segmentation and reporting capabilities, with advanced data
filtering and over 300 reporting variations.

ISYS, a fast search suite that finds information across multiple file formats and languages.

KnowleSys, providing Web Data Extraction service on BlueWhale software system.


Metronome, secure web analytics - Better data than page tags without the privacy issues.

net.Genesis, providing net.Analysis for ad analysis and CartSmarts "visitor-centric" package for
segmented analysis of Internet browser and buyers.

NETMINING Reporter, an enterprise-level tool for analysis of e-commerce sites.

Redwood, an open source ASP based web log mining tool which uses Java and EJB technology.

SAS e-Discovery, e-business analytics package.

SearchExtract!  offers visual techniques and scripting for extracting and reuse web data.

Site Catalyst from omniture.com, powerful hosted web analytics solution.

STATISTICA Data Miner

Vignette V/5  E-business platform and applications.

VisiStat 2.0, a hosted analytics solution.

Visual Web Task  parses Web pages and extracts data and files using rich parsing patterns.

Web Trends, personal Web server log analysis tools, and enterprise-level CommerceTrends (TM)
platform

Web Usage Mining Consulting, E-commerce Data Analysis and Data Mining Experts

Web Analytics Demystified site, portal for web analytics.

WebQL, for creating turnkey web extraction applications, such as price collector, patent information
aggregator, etc.

WebSideStory, providing real-time analysis of Web-site traffic and e-commerce activity

WhiteCross Systems, provides ASP services in customer and clickstream analysis in UK and USA,
including clustering, profiling and segmentation.
Web mining aims to discover interesting patterns in the structure, the contents and the
usage of web sites. An indispensable tool for the webmaster, it has, nevertheless, a
long road ahead in which visualisation plays an important role.

In this issues  

E-commerce is already an established and vigorous reality that suffers from intense
competition. Nevertheless, many websites are facing the double challenge of delivering
sales critical information to their customers while gathering customer information about
their preferences in order to optimise the business. In this short series of articles we'll
see how information visualisation brings help with those challenges

The CRM cycle. The arrows show the three basic stages of the cycle whereas the outer ring indicates
the operations that can benefit from Information Visualisation.  
Source: Adapted from Ganapathy et al. by the author.

Recently I have found out that electronic commerce  doubled  its transaction volume in Spain during
2004 with a total amount of 890 M€. Although it's only a fraction of the global commerce its relevance is
increasing in many countries. 
In order to favour its development it's necessary to find tools that enhance the capabilities
of customer relationship management  (also known as CRM).

Ganapathy  and others propose a framework and a model for the CRM cycle that can be of interest
when considering how information visualisation can be of help in this field. 

The model is made up of a cycle with three main stages: 

Customer Attraction. This stage intends to attract the customer towards the website in order to
expose the client to our online offers. In this part of the cycle the customer must be able to find out
about our products or services in an easy and simple way so that he/she can  

find  easily what he/she is looking for in our site.

browse the available product information.

Here information visualisation can be very useful in order to

visualise search results

show product relates information  as wel as showing the product itself.

Customer Acquisition. This stage has the important mission of converting the visitor into a buyer of
our products. In order for this to happen the potential customer has to be able to:

evaluate  the product, when possible visualising it as close as he would do in a traditional shop

compare  the  product with other similar options.

select  the product that best fits their needs, eventually purchasing it

In this case information visualisation could help us by allowing us to do the 

online evaluation, comparison and selection  by means of visualising  suitable comparisons


between products of similar features and/or prices.

Customer Analysis. In this stage the data gathered about the different actions and transactions
preformed by the customers are analysed in different ways with the goal of understanding and better
using the customers'
buying patterns

navigation patterns  through the website

problems they suffer  when trying to find information, products or both.

In the words of the authors of the article above mentioned for this analysis it is extremely useful the

visualization of customer "clicks". Understood as the analysis of online transactions, the


comprehension of the customer, their buying preferences, the "hottest" pages and those that don't
mean anything to customers.

2) Visualisation of search results  .

This is one of the areas where the interest is most universal , not only because of its
importance regarding CRM but also in the majority of the other fields. Quickly finding what is of our
interest amongst the overwhelming amount of data and information available is  one of the most
important challenges we are facing in this digital era .

The usual search techniques consist of the introduction of a query in the form of key words or free
language. Search algorithms typically return a long list of results with a higher or lower level of success
depending on which search engine we are talking about. As we have seen already in other issues of this
magazine there are different attempts to show the results in a visual
way  like Grokker, KartOO  or Flamenco. 

Among those that we haven't yet reviewed at Infovis.net we find  Girafa, a visual tool that uses the lists
that other search engines (like Google or Yahoo among many others in a long list) produce, in order
to show said results as screenshots of the URLs returned by the query.  Girafa provides a
web service (in this case you have to pay) that generates the thumbnails and can be integrated with
your web search engine. You can download a free demonstrator in the form of a toolbar for Internet
Explorer that shows the images on the left hand side of the browser.

A specially interesting case is the 3D Model Search Engine  created by the Princeton Shape
Retrieval and Analysis Group of Priceton University . This search engine has a database with
36,000 3D models. In order to find a particular shape you can operate in two mutually non-exclusive
ways.

On one hand we have the typical textual keyword query, for example entering the word "umbrella" and
on the other hand we can  draw with the mouse three different approximate views  (sketches)
of the shape we are looking for.   With them this powerful serach engine provides a series of matching
shapes that correspond to all the models that accomplish in one way or another the drawings that
served as specifications.

There are more initiatives regarding presentation of information retrieval that we will review in future
issues. The important thing is that  it appears to be clear that visualisation can add another
dimension when we talk about finding the results  that are relevant for us. In my opinion it's
also clear that we are still a long way from having a widely accepted system with a powerful impact in
the way we see the results of our search. But everything will come in its due time.

Presentation of product related information

When we visit a real world shop we have the opportunity to browse the shelves, touch and look at the
products the way we want. We can even try out their funcionality. On the contrary  in a virtual shop
one of the main problems is that we can't touch and sometimes we can't even see the
product we want to buy.  The information is limited to some written data and maybe some
photographs. 

Again visualisation has considerable potential in this field. Many e-commerce sites already incorporate
images of the products seen from different angles. The most innovative ones use  3D visualisation
techniques that allow you to see the product in 360º views  covering all possible directions.
They typically use technologies like  QTVR  (Quick-Time Virtual Reality),  VRML  (Virtual Reality
Modeling Language) or other similar tools that represent virtual worlds. We have seen them in use for
navigating virtual cities but it is probably in the presentation of products where you can
get the best out of them. 

These 3D technologies, when applied to the Web, are generically called  Web3D. There is a consortium
to promote the use of these technologies called, in a glimpse of originality, the  Web3D consortium.
There also exists an ISO standard called  X3D  that defines a runtime system and a networked 3D
application distribution mechanism compatible with the XML specifications.

Virtue3D Room designer: In the image you can see a chair selected from a chair catalogue and then
placed in the room. The white lines that surround the chair indicate that it is selected so that we can
transfer and / or rotate it in order to properly place it according to our will Source: Screenshot as it can
be seen in Internet, by the author .

Virtue3D Room designer: The result of selecting and placing several pieces of furniture in the
selected room. Mouse buttons allow the user to rotate and transfer the objects throughout the room.
Source: Screenshot as it can be seen in Internet, by the author.

An example of this is  Virtue3D Room Designer, a web based application created by  Virtue3D  for
the furniture industry that allows the user  to see a 3D furniture catalogue , changing the
perspective with which we see the selected item. We can zoom in or out as if it were very close or far
away. We can also select a room from several different ones and then we can put different pieces of
furniture by selecting them from the catalogue and then placing them into the room in the location and
orientation that appears more suitable to our needs. 
Mouse buttons allow you to move within the room (left button) or look around you and
zoom the view in or out (right button) . The same scheme applies to a particular item once selected
by clicking on top of it. An outline made of white lines appears and then the left button allows you to
rotate it while the right one transfers it throughout the room.

Another, more pragmatic example since it's a website already in use, is  Itacabox oriented mainly to
interior decoration professionals wishing to have access to existing catalogues of the
furniture market in order to design environments and decoration online  that can then be
shown in 3D to their customers. Interesting. Itacabox uses the  Outline3d  technology that enables the
user to define the parameters of a house from a 2D scheme of the floor that is being constructed very
easily. It also allows us to generate the rooms in 3D from the previously entered parameters so that you
can add furniture with existing catalogue items that you can currently find on the market. Outline 3D
uses the Cortona  VRML client by Parallel graphics  .

Another application of the use of this type of technology, mixed with traditional 2D


photographs, is the visualszation of on-line homes in real estate firms like  Bostad Uppsala  that
sells online houses with excellent graphic work that allows the potential buyer to get a very close to the
real idea of the house he is looking for. 

Bostad Uppsala: Interior of one of the houses for sale in this Swedish real estate agency. The textual
description is complemented with photographs of all the rooms,  3D diagrams and also has a 3D
furniture selection system.  
Source: Screenshot as it can be seen in Internet, by the author.
DeerLodge Centre: Virtual Tour through the facilities of this health care center. In the lower right  
window you can  see a 3D view of the rehabilitation room that shows it in 360º, and even looking at the
ceiling or the ground by using the arrows that appear in the screen. Source: Screenshot as it can be
seen in Internet, by the author.

Virtual tours give rise to the possibility of getting an idea of how the places you want to
visit during a trip look like before contracting it. Systems where you can see the interior
of a hotel, choose a restaurant or decide if it is worth visiting an old people's
residence  are beginning to become quite common in those types of entities.  In this sense we can
consider the Deer Lodge Centre  of Winnipeg,  Canada as a clear example. 

It is quite easy to find hotels in Internet offering virtual tours . Showhotel  shows you what
you can achieve by using streaming video, 360º panoramas or simple photo carrousels in order to
choose and book a hotel room having the sensation of "having been there", but  you can find a lot
more in Internet. 

3) Online evaluation, comparison and selection

Once we have solved the problem of visualising the product in its whole range we, as buyers, are in front
of another difficulty. How can we compare price and performance of different products  we
are interested in?. Typically we should go to the web pages of each one of them, get the data and
maybe write it down or print the pages to make the comparison. This could be specially tedious when
the formats of the pages are very different or when the products appear grouped instead of being
addressed individually. 

One of the answers to this problem is  the shopping bot. PricingCentral  classifies a lot of them into
different categories, along with short descriptions of their capacities. Basically a shopping bot is a piece
of software that, provided with precise specification of the articles we wish to buy, searches Internet for
on-line shops, finds the prices of said article and returns a list of the results found.
Shopping bots typically specialise in finding the best price for the same product, but  they can be
used to compare products or to find products with certain features , depending on the
degree of versatility or "intelligence" of the same.

Other systems act as purchase finders, where we can enter a product, even if it's not very precisely
specified and get a list of the products that fit the specified features, along with their price range in the
different shops. For example introducing "Pentax" in  Shopzilla  we get a listing of the digital cameras
of that manufacturer with photos and a price range of the said cameras.

The problem, as with other topics already reviewed in this series, is that  the results are presented
in textual form, and only in the best cases as a comparison table . This presentation, which in
fact is a lot compared with having to do it manually, is inefficient as a way of making decisions, specially
when the list has a lot of entries. For this reason visualisation is increasingly being used as a way of
solving those decision tasks.

IBM has developed an application called  VOPC  that uses parallel coordinates  á la CityO'Scope  (we
recommend re-reading article number 54 and downloading the demo of CityO'Scope in order to
understand the power if this type of visualisation and how appropriate this coordinate system is when
selecting the elements of a set that fit with certain restrictions). 

In parallel coordinates each variable is represented on a linear axis parallel to any other
one. Each product is represented by a jagged line that joins each one of their values on every axis. Each
axis has two sliding bars that allow you to restrict the possible values of each variable between two ends
of what we consider acceptable values for that product in that axis, disregarding all the products whose
values lie outside that range. This way it's very easy to find the products that fit all our range of
requirements. 

Among other properties VOPC is capable of presenting categories of products in hierarchical form coded
through the use of colours associated to the different categories and sub-categories.

In the end, new visualisation systems allow us to facilitate an intrinsically difficult activity,


as it is the selection of the most interesting choice among an extensive catalogue which
options have a great number of features. 

Visualisation of online transactions

We have already seen in issues number  65, 66  and 67  different aspects of online transactions that
are indispensable to know in order to obtain relevant information about our customers that could help
finding

the utilisation of the pages  of our website

what concepts are the ones our customers search for

which ones among them do exist in the website, but are difficult to find

which ones are searched for but aren't present  in the website

which pages get most attention  and which ones are ignored.

the effectiveness of the marketing actions

customer reactions  to online promotions

optimum placement of advertisements  and publicity.

the patterns that show the behaviour  of our customers

web navigation paths


patterns related to the acquisitions  of products

payment patterns

Although much of the information described above can be deduced from the intelligent retrieval of the
information stored in  logfiles  and other historic archives, when the volume of data is large,  other
techniques are required  in order to facilitate its comprehension and, above all, the  quick
detection of behaviour patterns , that can change in matter of hours and that we need to identify in
order to take the appropriate decision.

Information visualisation is beginning to provide solutions to those needs also. In the articles above
mentioned we already saw tools like  Ebizinsights(developed by Advizor Solutions, formerly Visual
Insights), VISVIP  by John Cugini or Anemone  by Ben Fry for traffic
analysis, Analog  or Nedstat  for the treatment of logfiles, among others.

SPSS, a company devoted to statistical software, using the strategy described in the book "The Grammar
of Graphics" (see issue number 74  Graphical Grammar)  has developed software development kit in
Java language called nVIZn  (pronounced envision) that allows the user to create in a relatively simple
way interactive visual applications for data analysis customised to the needs of the clients. 

Regarding this, the applications written in Java are increasingly popular  allowing the
programmers to build graphics that cover a broad range of possibilities, from multipanel arrays of simple
graphics up to sophisticated hierarchical graphs, passing through treemaps and complex combinations
of elementary graphics.
CRM applications of visualisation are, as we have seen, increasingly present in the toolkits used in e-
commerce. It appears, nevertheless, that  we are far from reaching levels of use such that we
can consider them as widely implanted . The introduction of the same, as it happens with most of
the issues related to visualisation, is taking place slowly. 

The power of this type of tools is undenyable. Maybe the problem resides in a certain
dispersion of approaches, some fields where applications are very scarce (logfile
visualisation, for example) and the fact that we are not used to apply more than very
elementary graphics as a regular media for expression and analysis (visual illiteracy is
still high).

  we spoke about Customer Relationship Management (CRM) and we also saw the importance of
detecting user behaviour patterns. We spoke as well about the relevance of information visualisation in
presenting the results. There we treated this from the  perspective of the customers themselves
(how do I find and select  what I want?)  and from the perspective of the  business manager
(what do my customers prefer, and how do they behave?).

Nevertheless, if we put ourselves in the shoes of the  webmaster, understood as the person
responsible for the web and its architecture, we'll see that it's crucial to  know the real structure of
the web, its contents and the usage  the customers make of it. It can seem a nonsense thinking
that a webmaster doesn't know the structure of his own web, specially if he/she contributed to its
creation. I can certify from my own experience that the website one has in his/her mind or even in the
documentation, is not usually exactly the same as the real thing, mainly due to errors and
misinterpretations, specially in large websites.  

Web mining can be defined as the  integration of the information gathered by traditional data
mining methods and techniques with information related to the web.  In a simplified way we
could say that it's data mining adapted to the particularities of the web.
As Patricio Galeas  explains in his web page  about web mining, its scope covers mainly three areas
within the field of knowledge discovery:

Web Structure Mining  (WSM).

This speciality intends to reveal the real structure of web sites through the gathering of structure related
data, and mainly about its connectivity. Typically it takes into account two types of links: static and
dynamic.

Web Content Mining  (WCM)

Its goal is gathering data and identifying patterns related to the contents of the web and the searches
performed on them. There are two main strategies:

Web page  mining, extracting patterns directly from the contents existing in web pages. In this case
the data in use can be

Free text

HTML  pages

XML  pages

Multimedia  elements

Any other type of contents existing in the web site.


Search results mining, intending to identify patterns in the results od the search engines. 

Web Usage Mining  (WUM)

Here the goal is to dive into the records of the servers (logfiles) that store the information transactions
that are performed in the web in order to find patterns revealing the usage the customers make of it.
For example the most visited pages, usual visiting paths, etc. We can also distinguish here:

General access pattern  tracking. Here the interest doesn't rely on the access patterns of a
particular visitor but on the integration of them into trends allowing us to re-structure the web in order
to facilitate our customer's access and utilisation of our web site. 

Customised  access pattern tracking. Here what we look for is gathering data about the individual
visitor's behaviour and their interaction with the website. This way we can establish access/purchase
profiles so that we can offer a customised experience to every customer. An archetypical case of this is
amazon.com and its purchase advice and suggestions.

Web mining is a discipline with an important potential. Despite the increasing and huge volume of
existing web sites the proportion of them using web mining tools to analyse  their structure,
contents and usage in order to improve the service to the user an the profitability of the business  is
still low. 

On the other hand, web mining suffers from the same problems of the general excess of
information: we need visualisation tools to enable us to digest and interpret the many results
it provides.  

Application

Web mining is moving the World Wide Web toward a more useful environment in
which users can quickly and easily find the information they need. It includes the
discovery and analysis of data, documents, and multimedia from the World Wide Web.
Web mining uses document content, hyperlink structure, and usage statistics to assist
users in meeting their information needs. The Web itself and search engines contain
relationship information about documents. Web mining is the discovery of these
relationships and is accomplished within three sometimes overlapping areas. Content
mining is first. Search engines define content by keywords. Finding contents’ keywords
and finding the relationship between a Web page’s content and a user’s query content
is content mining. Hyperlinks provide information about other documents on the Web
thought to be important to another document. These links add depth to the document,
providing the multi-dimensionality that characterizes the Web. Mining this link structure
is the second area of Web mining. Finally, there is a relationship to other documents on
the Web that are identified by previous searches. These relationships are recorded in
logs of searches and accesses. Mining these logs is the third area of Web mining.

Understanding the user is also an important part of Web mining. Analysis of the user’s
previous sessions, preferred display of information, and expressed preferences may
influence the Web pages returned in response to a query.

Web mining is interdisciplinary in nature, spanning across such fields as information


retrieval, natural language processing, information extraction, machine learning,
database, data mining, data warehousing, user interface design, and visualization.
Techniques for mining the Web have practical application in m-commerce, e-
commerce, egovernment, e-learning, distance learning, organizational learning, virtual
organizations, knowledge management, and digital libraries

Web Mining Application in E-commerce Customer Behaviour Analysis

Web Mining Application in E-commerce Transaction Analysis

Web Mining Application in E-commerce Website Design

Web Mining Application in E-banking

Web Mining Application in M-commerce

Web Mining Application in Web Advertisement

Web Mining Application in Search Engine

Web Mining Application in Online Auction

Web Mining Application in Online Knowledge Management

Web Mining Application in Online Social Networking


Web Mining Application in E-learning

Web Mining Application in Blog Analysis

Web Mining Application in Online Personalization and

Recommendation Systems

Web Mining and Intelligent Web Services

Case and Empirical Studies

Conclusion

As the popularity of the World Wide Web continues to increase, there is a growing need to develop
tools and techniques that will help improve its overall usefulness. Since one of the principal goals of the
Web is to act as a   world-wide distributed information resource, a number of efforts are underway to

develop techniques that will make it more useful in this regard. The term   Web mining  has been used to
refer to different kinds of techniques that encompass a broad range of issues. However, while
meaningful and attractive, this very broadness has caused Web mining to mean different things to
different people [HFW 96,MJHS96], and there is a need to develop a common vocabulary for all these
efforts. Towards this goal, in this paper we proposed a definition of Web mining, and developed a
taxonomy of the various ongoing efforts related to it. Then we presented an brief survey of the research
in this area. Next, we concentrated on the aspect of Web mining which focuses on issues related to
understanding the behavior of Web users, called   Web usage mining. We provided a detailed survey of
the efforts in this area, even though the survey is short because of the area's newness. We provided a
general architecture of a system to do Web usage mining, and identified some of the issues and
problems in this area that may require further research and development.

You might also like