Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

International Journal of Advanced Research in Engineering and Technology (IJARET)

Volume 11, Issue 4, April 2020, pp. 580-594, Article ID: IJARET_11_04_056
Available online at https://iaeme.com/Home/issue/IJARET?Volume=11&Issue=4
ISSN Print: 0976-6480 and ISSN Online: 0976-6499
DOI: https://doi.org/10.34218/IJARET.12.4.2021.056

© IAEME Publication Scopus Indexed

ANALYTICAL IMPLEMENTATION OF WEB


STRUCTURE MINING USING DATA ANALYSIS
IN ONLINE BOOKING DOMAIN
Dr. M. Xavier Rex
Director, Carmel Group of Institutions,
Mandamarri, Mancherial, Telangana, India

ABSTRACT
In today ’s global business, the web has been the most important means of
communication. Clients and customers may find their products online, which is a benefit
of doing business online. Web mining is the process of using data mining tools to analyse
and extract the information from a Web pages and applications autonomously. Many
firms use web structure mining to generate suitable predictions and judgments for
business growth, productivity, manufacturing techniques, and more utilizing data
mining business strategies. In the online booking domain, optimum web data mining
analysis of web structure is a crucial component that gives a systematic manner of new
application towards real-time data with various levels of implications. Web structure
mining emphases on the construction of the web's hyperlinks. Linkage administration
that is done correctly can lead to future connections, which can therefore increase the
prediction performance of learnt models. A increased interest in Web mining, structural
analysis research has expanded, resulting in a new research area that sits at the
crossroads of work in the network analysis, hyperlink and the web mining, structural
training, and empirical software design techniques, as well as graph mining. Web
structure mining is the development of determining structure data from the web. The
proposed WSM approach is a system of finding the structure of data stored over the
Web. Web structure mining can encourage the clients to recover the significant records
by breaking down the connection situated structure of Web content. Web structure
mining has been one of the most important resources for information extraction and the
knowledge discovery as the amount of data available online has increased.
Key words: Mining, Data Analysis, Online Booking, Web Structure.
Cite this Article: M. Xavier Rex, Analytical Implementation of Web Structure Mining
using Data Analysis in Online Booking Domain, International Journal of Advanced
Research in Engineering and Technology, 11(4), 2020, pp. 580-594.
https://iaeme.com/Home/issue/IJARET?Volume=11&Issue=4

http://www.iaeme.com/IJARET/index.asp 580 editor@iaeme.com


Analytical Implementation of Web Structure Mining using Data Analysis in Online Booking Domain

1. INTRODUCTION
The Internet is a global network that is continually changing and unstructured. The Internet is
the world's largest information source. Web mining aims to extract relevant information from
the internet. It is an interdisciplinary field that includes data mining, machine learning, natural
language processing, analytics, databases, recovery of information, media, and other
techniques. The amount of information available on the Internet is vast and readily
available(Kumar 2010). Knowledge is gained not only through the contents of websites, but
also from the Web's distinctive features, such as its hyperlink architecture and range of contents.
Study of these features frequently shows fascinating trends and new knowledge that can be
useful in improving user efficiency, hence approaches for extracting data from the web are an
unexpected topic of research. These strategies aid in the extraction of information from the Web
data by utilizing at least one structural or procedure (Weblog) information in the mining
procedure.
The Web is a large, exploding, diversified, and mostly unstructured data source that supplies
a tremendous amount of information while simultaneously increasing the complexity of dealing
with it from the many views of resource providers, Web service providers, and industry
experts(Victor 2016). The following are regarded as Web mining challenges: The Internet is
vast, and web pages are the semi-structured; Web data has a wide range of meanings. The
measure of the quality of information extracted; Inference of understanding based on the facts
gathered. Web structure mining is being used to identify the theory behind the connection
architectures of the Web pages, catalogue, and produce knowledge such as comparison and
NLP), Machine Learning, etc(Anurag Kumar and Singh, n.d.). Web mining includes the usage
of data mining tools to identify and retrieve data from the World Wide Web automatically. Web
structure mining assists people in retrieving necessary documentation by analyzing the Web's
framework.
Web mining involves the standard data mining application methods to the Web. However,
the inherent features of the Web necessitate significant tailoring and extension of existing
approaches. To begin with, even though the Web includes a massive amount of material, it is
dispersed via the internet. When we begin mining, we must first obtain the Webpage(Suvarn
Sharma and Bhagat 2016). Second, because web pages are data that can be organized,
information must be retrieved and represented in some manner for easier processing. Third,
because Web content has a wide range of meanings, the training or testing data collection should
be sufficiently big(Martínez‐Torres et al. 2011).Despite the problems mentioned previously,
the Web also offers diverse strategies to assist mining, for the instance, between linkages.
Web pages were an essential reserve to be leveraged. Aside from the difficulty of finding
necessary details, users may encounter other challenges when communicating with a Web,
including degree of value on the evidence create, the acquisition of new knowledge from the
content on the web, customization of the data create, and acquiring knowledge about those
another users(Boddu et al. 2010). Web mining methods can be used to partially or overcome
the concerns listed above. Further more, web mining methods should not be the only methods
available for resolving these issues. Other research groups, including database, machine
learning, and retrieval of information, also are tackling the aforementioned challenges(Chopra,
n.d.). This circumstance makes it difficult to know what types of Web mining exist.

2. RELATED WORK
(TasnimSiddiqui and Aljahdali 2013)the web is the most effective channel of statement in the
modern business. Many of the businesses were rethinking about their company strategy to
increase output. Customers and clients can find their products and unique business on the
internet, which gives them the option to do business. When compared to a traditional workplace,

http://www.iaeme.com/IJARET/index.asp 581 editor@iaeme.com


M. Xavier Rex

online business eliminates the barriers of time and space. Large corporations all about the world
were discovering that the e-commerce is more than just marketing on the Internet. Rather, it
optimizes efficiency in strategies to succeed with the other market giants. Data mining, also
known as knowledge finding, is employed by the researchers. Web mining is a data mining
technology which is used on the WWW. The information of wealth on the web.
(Zubi, n.d.)With the Web's rapid development, users can easily become lost in its complex
hyperstructure. The primary objective of the administrators of these websites is to deliver useful
data to users to meet their demands. Web mining is among the ways that can assist website
owners in this regard.This method heavily relies on web structure mining. In the web structure
mining, two webpage ranking algorithms, PageRank and the Hyperlink-Induced Topic Search
(HITS) are extensively utilized. When awarding rank scores, all methods treat all connections
equally. This paper also included a comparison of the two methods.Ranking Web pages is a
significant priority since it helps users find highly rated pages thatwere appropriate to their
query. Various factors have been suggested to rank web content's highest level of efficiency,
and this paper also includes a brief review of the two most notable ones.
(Babu, Sathish, and Ashok 2011) web usage mining is a type of web mining that uses data
mining algorithms to extract the useful data from a World Wide Web users' navigational
activity. Web Usage Mining normally involves of three tasks: pre-processing, design analysis,
and information retrieval. Data pre - processing needs to clean the user's log file by deleting
system logs including such errors or failure and repeated requests for the same address from the
same address, among other things. The primary goal of pattern recognition is to filter out boring
data and to imagine and explain stimulating patterns for users. The statistics gathered from the
log file can aid in the discovery of information. This information can be utilized to make
decisions on numerous criteria such as Excellent, Medium, and Weak users, as well as
Excellent, Moderate, and Poor websites, depending on hit rates on the website. The website's
architecture is reconstructed depending on user behavior or hit numbers, which offers fast
response to internet users, saves computer amount of memory, and so reduces HTTP requests
and resource use. This study tackles issues in three stages of the Web Usage Mining as well as
a Web Structure Mining.
(Hosseinkhani, Chaprut, and Taherdoost, n.d.)Criminal online data regularly provides
unknown and useful information to security agencies. The digital data used in forensics
investigation include data more about offenders' social media. However, there is a difficult issue
in evaluating those pieces of data. It is attributed to the fact that a researcher must manually
process retrieve valuable knowledge from web page text but then make relationships between
different sets of data and categorize people into a structured database, after which the set seems
to be prepared to be examined using massive terrorist connectivity analytical techniques.It is
assumed that the manually organizing method of data analysis is inefficient since errors are
likely to occur. Furthermore, because the quality of the resulting analyzed data is dependent on
the investigator's experience and expertise, its trustworthiness is not continuous. The more
skilled an operator is, the better the outcome. The major goal of this research is to propose a
framework to solve the process of researching suspected criminals using forensic analysis of
data, which covers the dependability gap.
(Awasthi and Gupta 2019)Since the World Wide Web is a massive library that is developing
rapidly, users can gather data and travel among different sites on the Net. When surfing the
web, users are frequently unable to reach the lookout page. Web customization is a strategy
proposed to relieve users of the burden of information overload on the internet and to provide
them with necessary information based on their requirements. Web customization is a strategy,
traditional marketing, and artistic work. Personalization necessitates implicitly or explicitly

http://www.iaeme.com/IJARET/index.asp 582 editor@iaeme.com


Analytical Implementation of Web Structure Mining using Data Analysis in Online Booking Domain

gathering visitor data and leveraging that data into the digital distribution system to determine
what data give to visitors and present it.

3. PROPOSED METHODOLOGY
Web Mining
The WWW is massive and rapidly expanding. It has a tremendous amount of material that is
constantly developing and updating. Different companies, institutions, public bodies, and
service centres keep their knowledge up to date regularly. The web pages lack a basic pattern
and also have a complicated style. Furthermore, web pages are more complexly arranged than
traditional text texts(Anurag Kumar and Singh, n.d.). The WWW offers its services to a wide
range of online surfers. Users of the internet may have a wide range of interests, needs, and
experiences. When a user looks for information available on the internet, he or she is only
interested in a small part of the content(Boddu et al. 2010). The challenges listed above inspire
people to figure out how to use internet resources effectively and result in web mining. The
majority of scholars refer to web mining as any strategy that applies data gathering to web data.
Web mining was described as use of the data mining techniques to retrieve the knowledge
through web data. Web mining tasks are classified into three categories: web usage mining,
web content mining, and web structure mining.

Figure 1 Web Mining Process

Techniques of Web mining


Web mining is widely classified into three categories depend on the set of information also be
gathered, as seen in Figure 2: web content mining, web structure mining, and web usage mining.

Web Content Mining


It's the process of obtaining meaningful information from online document content. Web pages
can contain textual information, multimedia, images, and graphs(Dinucă and Ciobanu 2012).
Often, the information of web documents is semi-structured and unstructured style, making it
complicated to extract relevant information or knowledge. Multimedia big data and text
analytics are useful for mining the information of online locations.

Web Usage Mining


The use of analysis tools to select relevant and regular usage patterns in web logfiles. Web
usage mining is the use of data mining methods to uncover stimulating and regular access
behaviours in weblog data(Asadianfam, Kolivand, and Asadianfam 2020). The intriguing usage

http://www.iaeme.com/IJARET/index.asp 583 editor@iaeme.com


M. Xavier Rex

patterns and knowledge retrieved will be used for a wide range of applications such as system
enhancement, website alteration, including the use of storage and pre-fetching to enhance user
navigation and personalized web.

4. WEB STRUCTURE MINING (WSM)


This determined by the construction of web data. It contains XML (hypertext markup language)
connections and tags used in online pages. HTML hyperlinks are commonly used to connect
various pages. So, by researching these hyperlink relationships, certain usage information, such
as the value of a specific web page, can be discovered(Anurag Kumar and Singh, n.d.). If a web
address is connected to other websites, it can be regarded as important and positioned in a
higher-level category. The most well-known researcher in the field of web structure mining is
networking site analysis.

Overview of WSM
Web mining entails the following activities:
• The process of locating and retrieving desired Web documents is known as resource
discovery.
• Information identification and the pre-processing: picking and pre-processing specified
information from the recovered Web resources mechanically.
• Simplification identifies broad trends on particular Web sites and also across numerous
sites dynamically.
• The analysis involves the study of validating and/or interpreting the patterns that have
been mined.

Figure 2 Process of web structure mining

http://www.iaeme.com/IJARET/index.asp 584 editor@iaeme.com


Analytical Implementation of Web Structure Mining using Data Analysis in Online Booking Domain

The three types of Web mining based on their relationship, which takes advantage of their
hyperlink structure. Web Structure mining is connected to hyperlink research and the
techniques presented below. Although there are three types of Web mining, the distinctions
between them are becoming increasingly blurred as they are all linked.
The task of Web structure mining would be arrangement with an architecture of the
hyperlinks only within a Web itself. Currently showing an old field of investigate. However, as
attention in Web mining has grown, structural engineering field has developed, culminating in
the emergence of a new study field known as Link Mining, placed at the connection of effort
in network theory, hyperlink, and machine learning and information extraction, structural
learning, and machine learning and graph mining, as well as inductive programming
techniques(Mohan, Kurmi, and Kumar 2017). This new field of research has the potential to
have a variety of application areas, such as the Internet.
The Web comprises a wide range of objects with little in common in terms of publishing
style and substance, with disparities in authoring content and style being higher than in
conventional collections of textual information. Web pages are the objects of the WWW, and
links are in-, out-, and co-citation. HTML tags, word occurrences, and anchor texts are
examples of characteristics. Because it is not able to precisely employ conventional approaches
such as database administration or information retrieval, this variety of objects presents new
opportunities and challenges. Some typical data mining jobs had become agitated as a result of
link mining. Following is a list of some of the potential link mining jobs that may be applied to
Web structure mining.

Classification based on links


Link-based categorization has been the most current advancement of the traditional data mining
activity to connected contexts. The aim is to forecast the classification of a website depending
on words on a site, connection among the sites, anchor text, HTML tags, and the other relevant
properties detected on the site.

Cluster Analysis Using Links


Cluster research helps to identify found in nature sub-classes. The earlier task, connection-based
clustering was uncontrolled and could be utilizing to identify designs within information.

Type of Link
Predicting the presence of links involves a wide variety of tasks, such as anticipating a sort of
the connection between two objects or forecasting the function of a link.

Strength of Link
Weights could be related to links.

Cardinality of Link
Predicting the number of linkages between items is the major task here. There are numerous
ways to construct conceptions of authority using the Web's link construction. The main goal of
building link mining applications is to make full use of our considerate of a Web's fundamental
social order.

5. WEB DATA STRUCTURE


The traditional data gathering scheme is primarily concerned with the information included in
the content of Web documents. Web mining techniques provide more information via
hyperlinks that connect various pages. The Web can be regarded as a directed labelled graph,

http://www.iaeme.com/IJARET/index.asp 585 editor@iaeme.com


M. Xavier Rex

with nodes representing documents or pages and edges representing hyperlinks among them.
Web Graph is the name given to this directed graph representation on the Web(Kapusta, Munk,
and Drlik 2018).A graph G is made up of two groups, B and W. B is a limited and nonempty
set of the vertices. The set W is made up of the two vertices, which are referred to as edges. The
notations B(G) and W(G) indicate the sets of nodes and edges of graph G, correspondingly. To
depict a graph, G = (B, W) can also be used. Figure 3displays a focused graph with three nodes
and a three edges.

Figure 3 Directed Graph G


G's vertices 𝐵, 𝐵(𝐺) = {𝑋, 𝑌, 𝑍}. 𝑊(𝐺) = (𝑋, 𝑌), (𝑌, 𝑋), (𝑌, 𝑍) of G's edges. The greatest
size of the network in a graph structure with 𝑛 vertices is 𝑛. (𝑛 − 1). With three vertices, the
greatest number of vertices is 3(3 − 1) = 6. There is no connection between
(𝑍, 𝑌), (𝑋, 𝑍), 𝑎𝑛𝑑 (𝑍, 𝑌) in the preceding example (Z, X). A structure called is said to have
been continued effective if there is a responsibility to make from a to b and also from v to u for
any pair of different vertices 𝑎 and 𝑏 in 𝑏(𝐺). The graph in Fig. 3 above is not tightly associated
because there is a no route from the vertices Z to node Y. The Web can be visualized as a vast
graph with hundreds of millions or billions of the nodes or vertices and billions of arcs or edges.
The section that follows describes hyperlink research as well as the techniques that are utilized
in hyperlink research for information extraction. Furthermore, the information on a Web page
will be grouped in a binary tree depend on different HTML and XML tags here on the webpage.
Mining work in this area has been centred on autonomously removing document object model
(DOM) components from documents.

Figure 4 Web structure mining in online sales domain

Hyperlinks
A hyperlink is a structural and functional unit which links one point on a web page to another,
either on the same or a distinct web page. An intra-document hyperlink attaches to a distinct

http://www.iaeme.com/IJARET/index.asp 586 editor@iaeme.com


Analytical Implementation of Web Structure Mining using Data Analysis in Online Booking Domain

section of the same page, whereas a multi hyperlink links two separate pages. There has been a
substantial amount of work on hyperlink analytics that may be used to generate an up-to-date
survey.
Many Web Sites never include words that are indicative of their primary function, and some
Web Pages have very little language, making text-based search tools challenging. However,
illustrate this page may be helpful(Sunny Sharma and Rana 2017). This type of "categorization"
appears in the text that accompanies the hyperlink to a website. Many studies have been
conducted, and answers to the challenge of searching, indexing, or searching the Internet have
been proposed, taking into consideration its architecture and also the meta-information
contained in hyperlinks and the text accompanying it into consideration.
Based on the Network Analysis, several algorithms have been presented. Using citation
monitoring, the Co-citation method and the Extended Co-citation method were developed.
These methods are simplistic, and deeper correlations between a webpage cannot be detected.
Three major algorithms, Hypertext Induced Topic Search (HITS), Weighted PageRank
(WPR)and Page Rank are reviewed and contrasted in detail below.

HITS
Authorities and Computing Hubs The two types of pages from the Web hyperlink architecture
in HITS concept: authority and centers (Boddu et al. 2010). HITS will discover authority and
hubs for a given query. Hubs and authorities have a mutually supportive connection, as per "a
better hub would be a page which ideas to several better authorities; a better institution in a
page that is referred as among excellent hubs." Figure 2 shows an example. HITS connects a
non-negative authorization weight a<j> to a non-negative hub weight b<j>. Display on Figure
3.

Figure 5 Densely linked authorities and hubs set

Figure 6 HITS basic operations

http://www.iaeme.com/IJARET/index.asp 587 editor@iaeme.com


M. Xavier Rex

Although HITS gives strong search outcomes for the wide variety of queries, it is not
function well in all circumstances for the three reasons listed below.
• Relationships among hosts that are mutually reinforcing. Several documents in one
server may point on a single document on another host, or even one paper with one host
may point to some documents on another host. These circumstances may lead to
incorrect definitions of what constitutes a good centre or a strong authority.
• Links that are generated automatically. Links included by the tool are common in web
documents generated by tools.
• Nodes that aren't relevant. Occasionally pages link to certain other pages that have
nothing to do with the search subject.

Model of Page Rank


The Page Rank algorithm uses the web's framework to evaluate the significance of online sites.
Brin and Page's technique expands on the knowledge of objective counting in-links equally by
the normalizing by the number of links on a page. "The suppose page X contains be set from 0
and 1. d is commonly set to 0.85. The more information about d can be found in the pages
𝑇1. . . 𝑇𝑛 that link to it," says the Page Rank methodology. The d parameter is an adamping
factor that can followthe section. 𝐿(𝑄) is also calculated as the number of links leading away
from page A. A page's Page Rank is listed as follows:
𝑃𝑅(𝑆1) 𝑃𝑅(𝑆𝑛)
𝑃𝑅(𝑋) = (1 − 𝑏) + 𝑏 ( +. . . + ) (1)
𝐿(𝑆1) 𝐿(𝑆𝑛)

Because the Page Ranks create a probabilistic model over internet pages, "The sum of all
page Ranks on all websites would equals one." "The d damping factor is the possibility that the
"randomised surfer" would become tired and want a random new site at every page." The rank
of a page is evenly separated throughout its out-links, making a significant contribution to the
rankings of a pages toward which it connect(Sherlin, n.d.). It's a repetitive formula; however, it
may be computed by starting with any set of rankings and making improvements to the
calculation until it conforms. Page Rank, It correlates to a web's normalised link matrix's
primary eigenvector, may be determined using a basic adaptive approach. The Page Rank
system takes about an hour to determine m's ranking of the pages million.

Table 1 Classification of Internet domain


Domain Context
.int Usually used by "International" sites, such as
NATO sites.
.gov Commonly seen on US Government websites.
.com A very well and widely used Domain name,
which has been utilised for any form of webpage.
.edu Universities, for example, are educational
institutions.
.mil Used for military installations in the United
States.
.org Originally meant for non-profit "organizations," it
is currently utilized for a wide range of websites.
For a time, it was run by the Internet Society.
.net Originally meant for Internet-related sites, and
now used in a wide range of websites.

http://www.iaeme.com/IJARET/index.asp 588 editor@iaeme.com


Analytical Implementation of Web Structure Mining using Data Analysis in Online Booking Domain

Knowledge Discovery
The data gathered through the website can aid in the discovery of knowledge. This information
can be utilized to make decisions on a variety of issues, such as 1. The web pages with the most
hits will be a most popular. 2. What are the various user movement patterns? 3. The amount of
time expended on each online page, that indicates the value of the web page. 4. If the amount
of time expended on a particular online page is insignificant, this suggests that the new website
has no vital information. 5(Boddu et al. 2010). The absence of a user's query for a website page
suggests that the page must be changed. 6. If a log file entry constantly states "redirect" for a
certain web page, the website creative director must be contacted. Excellent websites will be
relocated extremely close to the main website, while middle-class websites will be relocated to
the next level. If the website owner and designer agree, the pages with the highest visit count
can be prioritized for placement closer to the home page. The heap tree can be constructed
depend on the hit counts recorded in the log file throughout a specific session(Akshi Kumar,
Dabas, and Hooda 2020). This heaps tree developed will assist us in making decisions regarding
the architecture of a webpage during the following intervals so the most popular internet pages
can be carried extremely close to the home/parent website page. Within this restructure, web
users will have faster admittance to websites while also making the most use of resources and
computer system memory.

Applications
PageRank is being utilized by Google in conjunction with other features such as keyword
phrases, IR measurements, and vicinity. HITS was initially employed in IBM's Clever web
browser, and PageRank is used by Google in conjunction with other characteristics including
an anchor text, IR metrics, and proximity. The concept of a honesty stems from the knowledge
that want to find not just a list of relevant pages, but the best possible set of relevant sections.
The Web, on the other hand, is made up of not only webpages and also connections that connect
them(Kanathey, Thakur, and Jaloree 2018). This structure offers a great deal of information
that should be taken advantage of. PageRank and HITS were ranking algorithms in which the
scores can be calculated as a reference value in a linear model. HITS and the PageRank are
utilized as beginning points for creative solutions, and all these two methodologies have some
expansions. There are a variety of additional link-based approaches that can be used on the
Internet. Link resources can be employed for clustering or classifying Web pages in addition to
weighing them. The theory is founded on the assumptions that (1) if page p1 links to page
𝑝2, 𝑝1 should have comparable content to 𝑝2, and (2) if 𝑝1 𝑎𝑛𝑑 𝑝2 are get co-cited by certain
mutual pages, 𝑝1 𝑎𝑛𝑑 𝑝2 should likewise have alike content. Regarding their reference and co-
citation qualities among some of the pages, web pages can be grouped into a variety of the
connected page groups.
As a result, ranking depending on the content of data can be enhanced. PageRank, HITS, as
well as another link-based algorithm, will be used to rank page sections. The fundamental ideas
are that: (1) important blocks have higher weighted links, (2) a component is conceptually
related to a page if it has a link grounded with it to the page, and (3) two pages are get similar
if they are co-cited through around prevalent characterized the important variables of a block
to respect towards its shape and location in the computer monitor when browsing. Their
findings show that block-level Page Rank and HITS can greatly increase the recognition rate.

http://www.iaeme.com/IJARET/index.asp 589 editor@iaeme.com


M. Xavier Rex

6. RESULT AND DISCUSSION


The following is the main technique for putting web structure mining into action:
1. Manual or automatic extraction of page rank.
2. Obtaining hyperlinks from a web page.
3. Domain classification on the internet.
4. Computation of major domain influences.
5. Determine the attributes of the URL.
Take a look at Figure 7 for an example of a hyperlink design for three pages X, Y, and Z.
Equation can be used to determine PageRank for pages X, Y, and Z.

Figure 7 Structure of Hyperlink for 3 pages


Let's start with a PageRank of 1.0 and work way down. d is set at 0.85 as a damping factor.
𝑃𝑅(𝐿) 1
𝑃𝑅(𝑋) = (1 − 𝑑) + 𝑑 ( ) = (1 − 0.85) + 0.85( )
𝐿(𝐿) 2
= 0.15 + 0.425 = 0.575 (2)
𝑃𝑅(𝑋) 𝑃𝑅(𝐿)
𝑃𝑅(𝑌) = (1 − 𝑑) + 𝑑(( +( )
𝐿(𝑋) 𝐿(𝐿)
= 0.819 (3)
𝑃𝑅(𝑋) 𝑃𝑅(𝑌)
𝑃𝑅(𝑍) = (1 − 𝑑) + 𝑑(( +( )
𝐿(𝑋) 𝐿(𝑌)
= 1.091 (4)
Take the following PageRank scores from the second iteration; (2), (3), (4).
𝑃𝑅(𝑋) = 0.15 + 0.85(1.091/2) = 0.614 (5)
0.614 1.091
𝑃𝑅(𝑌) = 0.15 + 0.85 (( )+( )) = 0.875 (6)
2 2
0.614 0.875
𝑃𝑅(𝑍) = 0.15 + 0.85 (( )+( )) = 1.155 (7)
2 1
The following PageRanks were obtained after several more iterations of the above
algorithm, as given in Table 2.

http://www.iaeme.com/IJARET/index.asp 590 editor@iaeme.com


Analytical Implementation of Web Structure Mining using Data Analysis in Online Booking Domain

Table 2 The iterative calculation for Page rank


Iteration PR(X) PR(Y) PR(Z)
0 1 1 1
1 0.576 0.818 1.092
2 0.614 0.875 1.155
… … … …
15 0.702 0.998 1.296
16 0.702 0.998 1.296
It is simple to compute and obtain the PageRank scores for such a small group of pages, but
it is more difficult to use it for a Web with billions of pages. As could see that PageRank of C
is greater than PageRank of B and A in Table 1 above. It's because, as illustrated in Fig. 8, Page
C contains two incoming and two outgoing links. There are two inbound connections and one
outgoing link on Page B. Since Page A only has one incoming connection and two outgoing
links, it has the weakest PageRank. Following iteration 15, the PageRank for the pages in Table
1 is normalized. PageRank has resolved to a suitable range, according to previous tests. Figure
8 depicts the convergence of PageRank computation for Table 1 as a graph.
1.4
1.2
Values of PageRank

1
0.8
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8
Iteration

PR(X) PR(Y) PR(Z)

Figure 8 PageRank Convergence Calculation

Weighted page rank Algorithm


Rather than assigning a lower rank value to the more significant pages, this approach assigns a
higher rank number to them. Each outgoing link is given an organization value to its
significance by distributing the rank value of the page equally between its outgoing connected
links. Both the backlink and the forward link are given equal weight in this algorithm. The
number of links pointing to a specific website is known as an incoming link, whereas the
number of links flowing out is known as an outgoing link. Due to the use of two factors named
backlink and the forward link, the technique is more effective than the search ranking algorithm.
The number of inbound and outbound links is noted, and the labels Win and W out are easily
assigned. The meaning is provided to incoming and outgoing links in terms of weight values.
Win (x, y) and W out (x, y) are the two terms for this. The weight of links (x, y) as stated in the
equation is Win (x, y). Finally, the evaluation is determined by the number of inbound links to
page y and the total number of inbound links to all of document x's reference pages.
𝑖𝑛
𝑙𝑦
𝑊(𝑥,𝑦) = (8)
∑𝑃∈𝑅(𝑥) 𝑙𝑝

http://www.iaeme.com/IJARET/index.asp 591 editor@iaeme.com


M. Xavier Rex

The number of inbound links on page n is denoted by In. R(x) is the reference page of the
list page x, and lp - is the number of inbound links of page p. The weight of links (x, y) as
provided by the equation is W out(x, y). The final figure is derived from the number of outbound
links on page 𝑛 and the total number of outbound links on almost all of the page x's references
pages.
𝑜𝑢𝑡
𝑂𝑦
𝑊(𝑥,𝑦) = (9)
∑𝑃∈𝑅(𝑥) 𝑂𝑝
Page y's number of outbound links is Oy, while page p's number of outbound links is Op.
The weighted Page ranking would then be computed as follows:
𝑖𝑛 𝑜𝑢𝑡
𝑊𝑃𝑅(𝑦) = (1 − 𝑑) + 𝑑 ∑ 𝑊𝑃𝑅(𝑥)𝑊(𝑥,𝑦) 𝑊(𝑥,𝑦) (10)
𝑥𝜖𝐵(𝑦)

HITS Algorithm
Hubs and authorities are the two different types of Web pages. Organizations are pages that
contain essential information. Hub pages serve as reference lists, directing users to authoritative
sources. As a result, a good hub page on a subject links to many trustworthy sheets on that topic,
and a better authority page links to many good hubs pages on the same topic. Figure 4 depicts
hubs and authorities, as well as their calculations. According to Kleinberg, a page can serve as
both a hub and an authority. This circular connection leads to the development of the HITS
evolutionary method (Hyperlink Induced Topic Search).
The HITS algorithm considers the WWW as a graphical model G (N, T), with N denoting
pages and T denoting links. The HITS algorithm it contains two basic phases. The sampling
stage is the first, and the recursive step is the second. In the step of sampling, a set of specific
pages for the provided enquiry is gathered, i.e., a sub-graph S of R with a high domain authority
page count is obtained.
𝐻𝑤 = ∑ 𝐴𝑠 (11)
𝑞∈𝑙(𝑔)

𝐴𝑤 = ∑ 𝐻𝑠 (12)
𝑞∈𝐵(𝑔)
Where𝐻𝑤 means the hub weight, 𝐴𝑤 indicates the authority weight, 𝐼(𝑔) denotes the set of
references and referral pages on page 𝑤, and 𝐵(𝑔)represents the set of references and the
referral pages on page𝑤. The hub weight of a site is related to the total authority weights of the
pages it connects to. Figure 9 shows an example of how reputation and hub scores are
calculated.

Figure 9 Hubs and Authorities Calculation

http://www.iaeme.com/IJARET/index.asp 592 editor@iaeme.com


Analytical Implementation of Web Structure Mining using Data Analysis in Online Booking Domain

𝐴𝑤 = 𝐻𝑆1 + 𝐻𝑆2 + 𝐻𝑆3


𝐻𝑤 = 𝐴𝑅1 + 𝐴𝑅2 + 𝐴𝑅3
The following are the HITS constraints Algorithm:
• Hubs and authority: It's difficult to tell the difference between hubs and authority since
many websites serve as both hubs and authorities.
• Topic drift: Due to similar weighting, HITS will not always provide the most necessary
documentation to the user's searches.
• Automatically created links: HITS values automatically generated links equally, even if
they don't produce relevant topics for the user's query.
• Efficiency: In real-time, the HITS algorithm is inefficient.

7. CONCLUSION
Web Structure Mining is a useful tool for extracting data from previous user behaviour. Web
Structure Mining is a key component of this strategy. Web Structure Mining employs several
techniques to rank the relevant sites, all of which regard all links similarly when allocating the
ranking score.The significance of Web structure mining in terms of information retrieval is
discussed. Web mining is a study topic that focuses on web knowledge problems utilizing web
structure mining and has developed Link mining and block-level connection mining. We also
looked at two common algorithms, HITS, and PageRank, for this. Both practiced the relevance
of a webpage based on the web's hypertext links. The primary goal of this work is to investigate
the hyperlink organization and to grasp the Web graph directly. Web mining is the process of
obtaining data from the internet in the most efficient way possible. Web structure mining
generally involves many methods that lead to the retrieval of data from every website. In
general, web structure mining is the process of effectively retrieving data from a website for an
online user. It believes that, because this is such a broad topic with so much work to be done,
this paper will serve as a useful preliminary step for finding research programs.

REFERENCES
[1] Asadianfam, Shiva, Hoshang Kolivand, and Sima Asadianfam. 2020. “A New Approach for
Web Usage Mining Using Case Based Reasoning.” SN Applied Sciences 2 (7): 1–11.

[2] Awasthi, Satya Prakash, and Sandeep Gupta. 2019. “Analysis : Web Personalization
Association Via Web Mining Technique” 14 (10): 5.

[3] Babu, D Suresh, P Sathish, and J Ashok. 2011. “Fusion of Web Structure Mining and Web
Usage Mining” 2: 4.

[4] Boddu, Sekhar Babu, V.P Krishna Anne, Rajesekhara Rao Kurra, and Durgesh Kumar Mishra.
2010. “Knowledge Discovery and Retrieval on World Wide Web Using Web Structure
Mining.” In 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling
and Computer Simulation, 532–37. Kota Kinabalu, Malaysia: IEEE.
https://doi.org/10.1109/AMS.2010.108.

[5] Chopra, Preeti. n.d. “A Survey on Improving the Efficiency of Different Web Structure Mining
Algorithms” 2 (3): 3.

[6] Dinucă, Claudia Elena, and Dumitru Ciobanu. 2012. “Web Content Mining.” Annals of the
University of Petroşani. Economics 12: 85–92.

http://www.iaeme.com/IJARET/index.asp 593 editor@iaeme.com


M. Xavier Rex

[7] Hosseinkhani, Javad, Suriayati Chaprut, and Hamed Taherdoost. n.d. “Criminal Network
Mining by Web Structure and Content Mining.” Advances in Remote Sensing, 6.

[8] Kanathey, Kavita, RS Thakur, and Shailesh Jaloree. 2018. “Ranking of Web Pages Using
Aggregation of Page Rank and Hits Algorithm.” International Journal of Advanced Studies in
Computers, Science and Engineering 7 (2): 17–22.

[9] Kapusta, Jozef, Michal Munk, and Martin Drlik. 2018. “Website Structure Improvement Based
on the Combination of Selected Web Structure and Web Usage Mining Methods.” International
Journal of Information Technology & Decision Making 17 (06): 1743–76.

[10] Kumar. 2010. “Web Structure Mining: Exploring Hyperlinks and Algorithms for Information
Retrieval.” American Journal of Applied Sciences 7 (6): 840–45.
https://doi.org/10.3844/ajassp.2010.840.845.

[11] Kumar, Akshi, Vikrant Dabas, and Parul Hooda. 2020. “Text Classification Algorithms for
Mining Unstructured Data: A SWOT Analysis.” International Journal of Information
Technology 12 (4): 1159–69.

[12] Kumar, Anurag, and Ravi Kumar Singh. n.d. “A Study on Web Structure Mining” 04 (1): 7.

[13] Martínez‐Torres, M.R., Sergio L. Toral, Beatriz Palacios, and Federico Barrero. 2011. “Web
Site Structure Mining Using Social Network Analysis.” Internet Research 21 (2): 104–23.
https://doi.org/10.1108/10662241111123711.

[14] Mohan, Kuber, Jitendra Kurmi, and Sanjay Kumar. 2017. “A Survey on Web Structure Mining.”
International Journal of Advanced Research in Computer Science 8 (3): 227–32.

[15] Sharma, Sunny, and Vijay Rana. 2017. “Web Personalization through Semantic Annotation
System.” Advances in Computational Sciences and Technology 10 (6): 1683–90.

[16] Sharma, Suvarn, and Amit Bhagat. 2016. “Data Preprocessing Algorithm for Web Structure
Mining.” In 2016 Fifth International Conference on Eco-Friendly Computing and
Communication Systems (ICECCS), 94–98. Bhopal: IEEE. https://doi.org/10.1109/Eco-
friendly.2016.7893249.

[17] Sherlin, Mrs C Clement. n.d. “A Comparative Analysis Of Web Structure Mining And Link
Based Object Ranking.”

[18] TasnimSiddiqui, Ahmad, and Sultan Aljahdali. 2013. “Web Mining Techniques in E-Commerce
Applications.” International Journal of Computer Applications 69 (8): 39–43.
https://doi.org/10.5120/11864-7648.

[19] Victor, Dr S P. 2016. “Analytical Implementation of Web Structure Mining Using Data Analysis
in Educational Domain” 11 (4): 5.

[20] Zubi, Zakaria Suliman. n.d. “Ranking WebPages Using Web Structure Mining Concepts.”
Signals and Systems, 8.

http://www.iaeme.com/IJARET/index.asp 594 editor@iaeme.com

You might also like