Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Intelligent Web Mining Techniques using

Semantic Web
Epsilon
Nadeem Akram N CMR Institute of Technology,
V. Ilango
2022 First International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT) | 978-1-6654-3647-2/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICEEICT53079.2022.9768546

Bengaluru-560047,
Epsilon India, Bangalore-560037,
CMR India,
Institute of Technology,
nadeemakramhn@gmail.com
Bengaluru-560047, India, thirukkural69@gmail.com
Bangalore-560037, India,
nadeemakramhn@gmail.com thirukkural69@gmail.com

Abstract- The Evolution of Web over the years has made a impact on understanding the nature of the user and
significant progress on how the information is organized and generating user-related content based on preferences. The
stored, as the complexity of the data stored and retrieved web usage pattern is hot word for web content mining
increases it becomes mandatory to upgrade the methodology domains. The main purpose of these concepts is to capture
involved in achieving them. During web Access it becomes
the business logic of existing browsing experiences and
essential to determine the nature of the user so as to understand
his preferences in order to generate a customized look and feel of future facts related to access to determine surfing and web
the website and there by offering him a Personalized web surfing related content. Improve and personalize the look of all web
experience. The primary Stage of a customized web access is in users. This work reflects the latest advances in web content
determining what the user wants from the web via a proper management, improving existing web architectures from
semantic Query Methodology which represents what user need general content views to personalized content views, and a
rather than content prevailing in the web for request Query. great personalized web surfing experience that is correct for
When the user request is understood, the next task would to users to understand. The purpose is to be able to get. User
categorize the possible results in response to the query depending queries to understand future web usage patterns
upon relevance. Furthermore, the process would be to categorize
user data according different aspects to create web II.RELATED WORK
personalization. This takes the web to next level with web usage
prediction using Web log files there by creating a New This Segment represents an Analogy of the recent works
Customized web for every one with the major focus to represent done in the domain of advanced web content management
what matters to the users. techniques. [1] described the means of understanding the
Keywords: Personalised Web Experience, Web Content user web browsing through starting position with portal sites
Classification, Web Pattern Prediction, Web Indexing, Web maintaining it as a home page. This serves as a one stop
Analytics, Gradient descent, statistical information grid (STING), mode to reach the destined website required. However, this
Dimensionality Reduction Techniques. might not be effective means for a new emerging website.
Further they proposed NaviSOM a text mining technique for
I. INTRODUCTION: effective web content indexing which is based on machine
The Web content Segregation is an important task in learning to create an effective navigation structure. [2]
determining the nature to which a given request involves proposed an effective Web Query technique “Smart Query
using this we can organise the Web content in a more Engine” that creates an intelligent query autofill based on
enhanced way and Index them in an Effective manner, this recommendations of the past experience thereby enabling
reduces the complexity involved in searching the relevant the user to link the right input query word for a Given web
content for the user request. Proceeding further to Searching Process.[3]described an effective information
understand the typical mind set and nature of the Web user retrieval software tool “Wrapper” which is used to extract
and shape the content of the web page, is something which information from the web page based on the HTML tags and
will create a logical shift from generating a Generic web can re organise them in an effective manner during run time
content to a Customized web content, this will Contribute page loading, he described the page analysis phase in
towards web content classification to yield a high information retrieval which is an effective mode to
throughput with respect to Web Content expected and Web understand and locate the contents of the web
content delivered. This Steps in the Conventional Web page.[4]proposed Hybrid Object Model for deep web
Hierarchy will bring about a comprehensive shift over the information extraction where the entire web is considered as
Entire Web Content Management process. To take your web an entity with all the control over the web page is termed as
browsing experience to the next level, you need to generate object, each object is interrelated with other object with
web content according to the user behaviour understood in some relational parameters, this method enables to recognise
the settings and previous user experience. This allows for a different elements of the web page and re organise them
complete overhaul of a website with new content for each according to some objective function during page loading at
web user, bringing the web user's preferences to the run time.[2] defined data classification for a web page, as to
forefront. In addition, traditional web architectures do not categorize the content based on their purpose and relevance
have a framework for effectively maintaining weblogs of in a web page using popular data mining algorithms such
user browsing data and producing results to determine user KNN(k-Nearest Neighbour), Naïve bayes and others, where
in the ultimate purpose of them is to classify the elements
behaviour related to the browsing experience. This is of a data space.[4] proposed an integral concept of the web
achieved with web usage statistics. This has a positive mining in their web cube architecture as web personalisation

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on April 06,2024 at 16:42:21 UTC from IEEE Xplore. Restrictions apply.
where in a customized user based dashboard is created next layer. Web personalization is primarily aimed at
based on the user needs and preference in order to generate creating custom dashboards according to the user's browsing
web content on what is required with respect to a specific experience. Next, there is a web usage statistics layer that
user, further they devised a set of methodologies to remove effectively analyses user surfing behaviour in terms of web
dirty data from the web page there by making it more a log files and browsing history. At the top is a layer similar
content based rather than a description log file.[8] devised to predicting future browser settings in terms of historical
an effective means of web personalisation through sequence data and usage patterns.
alignment method which is used to partition users based on
a clustering scheme that will assist in analysing web user V. FINDINGS & PROPOSED METHODOLOGIES
behaviour and nature and can be employed for effective 1. Problem 01: Formation of content-based queries for
generation of personalised web dash board for the same user effective indexing.
with respect to their preferences. This methodology will
enable user to view only the content which they are a. Findings:
interested in rather than viewing a generic content.
1. Lack of content-based retrieval in the traditional web
III. OBJECTIVES crawler architecture.
A. Formation of content-based queries for effective 2. Irrelevant web indexing for unstructured queries.
indexing (Gradient descent).
b. Proposed Solution Metric
B. web page segregation based on content type (sting-
statistical information grid) 1. Facilitation of Effective content-based query indexing
through segment-based queries
C. Generating custom driven webpage for users
through web page classification. (Dimensionality 2. Transformation of unstructured irrelevant queries into
Reduction). Structured Context based semantic queries.

IV. PROPOSED WORK 2. Problem 02: web page segregation based on content
type.
1. Implementing the advanced Web Semantic Segment
based Query Indexing methodology for Effective web a. Findings
content indexing.
1. No means web page categorization in the prevailing
2. Incorporating Web Content categorization based on the architecture.
nature of the website.
2. In appropriate mechanism of linking domain specific
3. Devising advanced methodology in Web content sites.
classification for Effective Web Hierarchy.
b. Proposed Solution Metric
Custom Driven Web Browsing Experience 1. Constructing Web page categorization based on the
W U
website type.
E
Cognitive Web Usage Pattern Analysis S
B E
S R 2. Proposing efficient means of linking domain specific
P Web Usage Statistics D
A R Sites using STING methodology
C I
E Web Content Personalization V
E c. Problem 03: Generating custom driven webpage for
Web Content Classification N users through web page classification.
Web Content Categorization a. Findings
Semantic Segment Based Query Indexing
1. Similar web page content broadcasted to every user,
Advanced Web Content Management Techniques where the purpose each user may not be the same.
2. Presenting content on the web page which has no
Fig 1: Layered Web Personalization architecture.
relevance to end user need.
The figure above shows the layered architecture of advanced b. Proposed Solution Metric
web content management technology. The lower layers
represent the foundation of the architecture in which the first 1.Incorporating methods for generating purposed based
layer resides, and the function of the first layer's semantic dynamic webpage content.
segment-based queries is to handle user requests in one.
Performs segment-based automatic query responses based 2.Populating web page based on the content relevance
on the queries entered in an unstructured way to understand factor.
user content to the next layer. There is a web content
classification layer that effectively reorganizes web page
content based on the nature of the user, and above that is the

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on April 06,2024 at 16:42:21 UTC from IEEE Xplore. Restrictions apply.
VI. PROPOSED METHODOLOGY
A. Formation of content-based queries for effective indexing Algorithm: Web indexing (Unstructured Query)
using Gradient descent Algorithm: //Input: An Unstructured and unprocessed query.
Introduction: Gradient descent is a first order optimization //Output: The content-based query and optimized
algorithm popularly used in Performance optimization search result listing.
applications where in the objective function is set to Step01: Start
destination and the input function is set with the maximum Step02: Input query
Step03: accept query and determine the nature of the
value after every iteration the necessary computation is
query.
made on a negative scale to reach to the destination which Step04: Find category words in a given query For
contains the lowest point. example, Query: “working of US
constitution”, the category word would be
“US constitution” or Constitution or US
Step05: Store the category word in a word bucket at
level 01.
Step06: Find 2nd level category words for the given
query for ex:” working or US”
Step07: Store the category words on a word bucket at
level 02.
Step08: Similarly store all the relevant words at their
respective buckets based on the category of
relevance.
Step09: use the gradient descent approach to find the
relevance between 1st word of level 1 bucket
with 1st word of level 2 bucket.
Fig. 2: Gradient Descent Principle. Step10: Similarly find the relevance between the 1st
word of level 2 bucket and 1st word of level 3
Working: The Algorithm listed below Implements the bucket.
method of gradient descent in the following approach. Step11: compute the relevance factor among all 3
1.An objective function as the expected query in a structured buckets and generate a response query,
format is constructed. display it user.
Step12: the user based on the results will update the
2.The Given unstructured input query is taken as an initial query which is an optimized version of earlier
point. one.
3.After every iteration, the relevance of the word is Step13: repeat step05 to Step11 until the thresh hold
determined to understand the context of the query. value set by the gradient descent is reach post
4.A three level hierarchy of word bucket is maintained, which this would be treated as the final query
where in depending on the relevance of the word in the and the result will be displayed according to
query, the word is stored. the relevance of the updated query
5.A one to one mapping is done between word of all the Step14: End.
buckets for a given query.
6.The Above process is repeated until the Threshold value
of Convergence between the Expected query and observed Fig 3. Control flow web indexing
query is reached.

Results: The outcome for the above concept is listed as


follows.

1.The User enters the unstructured and unprocessed query in


the search box.
2.The query is accepted by the crawler and will break it into
word bags
3.Each word will be stored in a bucket depending upon its
relevance.
4.The crawler will auto generate the query and will pop out
an updated query in the filter bar.
5.If the user query is not the one over the filter than the user
must enter next level query based on the suggestion on the
filter. -
6.The above process will continue until the query is being
understood by the crawler.
7.Once understood the crawler will filter the search list with
respect to the understood query as shown below.
Fig. 4: Web Indexing Results

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on April 06,2024 at 16:42:21 UTC from IEEE Xplore. Restrictions apply.
5.Starting from the domain a pool of data is generated that
can be used to determine the nature of the web site.
6.the Data generation begins right from the URL name and
moves upwards towards various RSS- feeds, wiki links and
Domain providers to pool the data.
7.A set of Website category is already determined with their
code point.
8.Based on the data obtained from various sources the given
input URL is linked any one of the Training set web site
categories.
Fig. 5: Web Indexing Level 02
9.If no such category exists than a new category Code point
8.As we can see the enhanced version of the query, and is generated and preserved for future categorization.
updated search results.
Results:
9.The search results will contain exact context as specified
by the user and understood by the crawler. The Results are explained as follows:

B. Web page segregation based on content type 1.The Web users enters the URL to visit a specific website.

Introduction: Statistical information grid(STING) is an 2.The Algorithm Will accept the URL and Determine its
effective grid based clustering technique that can be used for category say for instance Gmail
spatial analysis of data, where in a training space is 3.All the Given URL will be Recognised as the Gmail
constructed in a series of levels, each level comprises a Website and the corresponding user transaction will be
number of cells which contains information, this is a top fetched.
down hierarchical model where in the top layer contains the
core output data, each of the cell of top layer comprises a 4.All the Particular transactions will be decomposed into a
number of cells at the next lower layer, the frequency of specific category term Mail.
each cell is computed by means of mean ,min and max
distribution of data. 5.All the User data with respect to the mail will be
displayed.

Fig 6: Statistical Information Grid (STING)


The Proposed concept is implemented by means of Sting
where in a base parameter for web content segregation is
set, a model is constructed from the initial base to segregate
the web page category.
The Method of Statistical information grid in Segregation of
the web page categories are as listed below.
Working:
1.Any given web page has a base level content which is a Fig. 7: Segregated Web content
series of html tags
6. Similarly when the user hits another link the domain of
2.Web page segregation is the effective process of the link and the user content associated with it will be
categorization of the web page based on the nature of the determined.
web page.
7.The entire listing will be Specifically categorized
3.The given Web site URL is accepted as the input.
8.The Above method will Categorize the web site based on
4.The Domain of the website is being determined the domain, and populate web user data with respect to it.

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on April 06,2024 at 16:42:21 UTC from IEEE Xplore. Restrictions apply.
C. Generating custom driven webpage for users through The Proposed concept is implemented by means of
web page classification. Dimensionality Reduction technique where in the Content
of the web page is reorganised based on the purpose of the
Introduction: Dimensionality Reduction is a machine user or application, which results in a unique web browsing
learning technique that is concerned with reduction the experience for each user with the existing web page content.
unknown parameters of a given Entity in order to analyse Working: The Above Algorithm Implements the
the performance with respect to fewer points on methodology of Dimensionality reduction with the Forward
consideration, this is an effective means to determine the feature variant in the following manner.
effectiveness of some attribute by disabling the other 1.The basic elements of a web page is being signified by the
attributes that are out of scope for the given objective HTML pages and its attributes.
function. Dimensionality reduction primarily comprises of 2.The purpose of the above methodology is to reorganise the
two flavours position of various HTML widgets.
1.Feature Extraction. 3.This is to ensure that a generic home page is turned into a
2.Feature selection. custom user page based on the purpose.
4.The idea is to Disable fewer contents of the page to ensure
only contents with the relevant attributes are active at any
given instance of time.
5.To understand the custom need, the Past browsing
experience and browsing preference are considered.
6.The outcome will reflect an amazing web surfing
experience for each user.
Results: The proposed concept is devised in order to
overcome the short comings of the existing
methodology and to lay out an effective means of web
content management Framework in all the respects which
will then bring about a comprehensive change about how
the entire Web content management Scenario will function,
Fig. 8: Feature Extraction Using Dimensionality reduction moreover the integration of Semantic Web will enable the
intelligent web concepts in a more Promising manner.
The implementation of the proposed concepts is as follows:
Algorithm: Web_Segregation (URL)
//Input A Fully qualified URL with Extension .COM Depending upon the algorithms proposed for the above
//Output: Category of the Website for ex: E-learning, Product, service
etc.
objective the results were implemented using an Open-
Step01: Start source Tool Web Scraper.
Step02: Accept URL
Step03: Determine The path extensions from left to right The Results for the above concepts are explained as follows
Step04: Construct Training data set, create category 1.The Above Picture Depicts the home page of the
of web sites with code points., wikiwand web site.
11_Knowledge_base,22_E-
Commerce,33_Software_products
4_Hardware_products
Step05: Assign base value to the input as 01_URL
Step06: Construct Solution Graph as follow
For Codepoint 01 to 44 repeat with 10
increments
do
Determine URL by name from training set and
assign it to code point
If Matching word found
Return code word with name
Else
Add the word into the training set
done
End Fig.10: Web Content Classification
For Code Point 01 to 131 repeat with 10 increments
Do 2.The user had logged in to the website as a result the user
Find the domain provider for the Given URL preferences were understood.
from training set
3.Based on the preferences, it was understood that the user
If Domain provider Found
Return URL with Domain and Code point
had more interest in learning about various algorithms
Determine the URL type from the Domain type 4.The Home page of the user is populated with only those
done contents which are of interest to him.
Step 7: End 5.The point to be focussed here is, no generic home page
was displayed rather the contents of the web page was
reorganised as per the preferences.
Fig. 9: Web Content Segregation

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on April 06,2024 at 16:42:21 UTC from IEEE Xplore. Restrictions apply.
[3] An, Y., Borgida, A., and Mylopoulos J, “Refining Semantic Mappings
Algorithm: Web Class (URL). from Relational Tables to Ontologies”, Second International Workshop on
//Input: A Fully qualified URL with Extension .COM Semantic Web and Databases, 2004, 84-90.
//Output: A web page with reorganised page content based
on the custom preference. [4] B. Fazzinga and T. Lukasiewicz, “Semantic Search on the Web.
Step01: Start Semantic Web – Interoperability, Usability, Applicability”, 2010. In this
Step02: Load the Home Page for the user. issue.
Step03: Prompt the user With Random Question Set
[5] Bizer, C., Cyganiak, R., Garbers, J., and Maresch, O., The D2RQ
from the Training set Such as “Want do you Platform. http://www4.wiwiss.fu-berlin.de/bizer/d2rq/
Want Today”, What is The Mood Today”.
Step04: Fetch the Values from the User and [6] An, Y., Borgida, A., and Mylopoulos, J. “Refining Semantic Mappings
from Relational Tables to Ontologies. Second International Workshop on
Repeat Step 03 Until the User Perspective Semantic Web and Databases, 2004, 84-90.
is understood. (As value Set
upon Qualifying parameter). [7] C. Bizer, T. Heath, and T. Berners-Lee. “Linked Data – the story so
far”. International Journal on Semantic Web and Information Systems,
Step05: Now for the Same Website, With 5(3):1–22, 2009.
No change Reorganise the Contents
With Respect to The Understood [8].Brijendra Singh, Hemant Kumar Singh, ”Web Data Mining Research: A
Behaviour. Survey”,2010, 978-1-4244-5967-4/10/$26.00 ©2010 IEEE [1-3].
Step06: Re Order the Flow of Web Page and [9] Chebotko, A., Lu, S., Jamil, H. M., and Fotouhi, F. “Semantics
the Element of the Web Page. (Logic Based on Preserving SPARQL-to-SQL Query Translation for Optional Graph
Dimensionality Reduction) Patterns”,2006, Technical Report TR-DB-052006-CLJF.
Step07: Allow user to perform the operations [10] Chen, H., Wu, Z., Wang, H., and Mao, Y. “RDF/RDFS-based
As expected, If any variations in the Relational Database Integration”,2006, 22nd International Conference on
User Behaviour is Noticed, then Pop Data Engineering, 2006.
down Questions to the User as in Step03:
[11] Jerome Robinson, “Data Extraction from Web Data Sources”, 2004
Step08: Store the User Nature in to the Access Proceedings of the 15th International Workshop on Database and Expert
Data file, and load the next home page Systems Applications, (DEXA’04) 1529-4188/04 $ 20.00 IEEE [1-4]
with Some of the Past Details.
[12] C. Murray, N. Alexander,” Oracle Spatial Resource Description
Step09: End Framework (RDF)”, 2005, 10g Release 2 (10.2).
Fig. 11: Control flow Web Content Classification [13] Erling, O., and Mikhailov, I., “RDF Support in the Virtuoso
DBMS”,2007, 1st Conference on Social Semantic Web, pp.1617-5468.
VII.CONCLUSION
[14] Han, L., Finin, T., Parr, C., Sachs, J., and Joshi, “A. RDF123: From
Web Content plays an integral part in overall categorization Spreadsheets to RDF”, 2008. International Semantic Web Conference,
of Web users into different segments thereby allowing them LNCS 5318, 451-466.
to access the web in a much simpler and more precise way,
[15] Harris. S. “SPARQL Query Processing with Conventional Relational
Web content personalisation enables us to segregate web Database Systems”,2005, International Workshop on Scalable Semantic
content in a number of ways which can be used to analyse Web Knowledge Base Systems.
each and every user’s web log data and inference important
pattern out of them to predict future web accessing. This is [16]. J. Lu, L. Ma, L. Zhang, J-S. Brunner, C. Wang, Y. Pan, Y. Yu. “SOR:
A Practical System for OWL Ontology Storage, Reasoning and
Enhanced can be accomplished using a number of Data Search”,2007. In Proc. of VLDB 2007, to appear.
Mining and Machine learning techniques. Web content
Optimization can be used to filter web content in a much [17] Jie Zou, Daniel Le and George R. Thoma, “Structure and Content
enhance manner which will further increase the web content Analysis for HTML Medical Articles: A Hidden Markov Model
Approach”, 2007, DocEng’07, ACM 978-1-59593-776.
classification rate up to a desired level of accuracy, all these
approaches lead to creation of a personalised web for each [18] Chong, E. I., Das, S., Eadon, G., and Srinivasan, J. “An Efficient SQL-
and every user, thereby presenting web content what matters based RDF Querying Scheme”,2005, VLDB.
to them the most and eliminating irrelevant and generic
[19]. J. Melton. SQL, XQuery, and SPARQL: Making the Picture Prettier.
content. In Proc. Of XML 2006.
REFERENCES [20] L. Ding, J. Shinavier, T. Finin, and D. L. McGuinness, “An empirical
study of owl: same as use in Linked Data”, 2010, In Proceedings of the
[1] C. D’Amato, N. Fanizzi, and F. Esposito. “Query answering and WebSci10: Extending the Frontiers of Society On-Line. To appear.
ontology population: An inductive approach, The Semantic Web: Research
and Applications”, 5th European Semantic Web Conference, ESWC 2008, [21]. J. S. Brunner, L. Ma, C. Wang, L. Zhang, Y. Pan, K. Srinivas.
Tenerife, Canary Islands, Spain, June 1-5, 2008, Proceedings, volume 5021 “Explorations in the use of Semantic Web Technologies for Product
of Lecture Notes in Computer Science, pages 288–302. Information Management,2007. pp. 747 - 756.

[22] Meera Alphy and Dr Ajay Sharma, “Study on online community user
[2] Rifat Ozcan, Ismail Sengor Altingovde and Ozgur Ulusoy, “In Praise of motif using web usage mining”, 2016, Journal of Physics: Conference
Laziness: A Lazy Strategy for Web Information Extraction”, 2012, LNCS Series 710 (2016) 012015 doi:10.1088/1742-6596/710/1/012015 [2-4].
7224, pp. 565–568.

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on April 06,2024 at 16:42:21 UTC from IEEE Xplore. Restrictions apply.
[23] M.M. Wood, S.J. Lydon, V. Tablan, D. Maynard & H. Cunningham,
“Populating a Database from Parallel Texts Using Ontology-Based
Information Extraction”, 2004, LNCS 3136, pp. 254–264.

[24] Perez de Laborda, C., and Conrad, S, “Bringing Relational Data into
the Semantic Web using SPARQL and Relational. OWL”, 2006, 22nd
International Conference on Data Engineering Workshops.

[25] Raghu Anantharangachar1& Ramani Srinivasan, “Semantic Web


techniques for yellow page service providers”, 2012, International Journal
of Web & Semantic Technology (IJWesT) Vol.3, No.3.

[26] Richard Vlach & Wassili Kazakaos, “Using Common Schemas for
Information Extraction for Heterogeneous Web Catalogs”,2003 ADBIS
2003, LNCS 2798, pp.118-132.

[27] S. Auer. “Making the web a data washing machine – towards creating
knowledge out of interlinked data. Semantic Web – Interoperability,
Usability, Applicability”, 2010. In this issue.

[28] T. Eiter, G. Ianni, T. Lukasiewicz, R. Schindlauer, and H. Tompits.


“Combining Answer Set Programming with Description Logics for the
Semantic Web Artificial Intelligence”, 2008, 172(12– 13):1495–1539.

[29] Pengpeng Zhao, Chao Lin, Wei Fang, Zhiming Cui,” A Hybrid Object
Matching Method for Deep Web Information Integration”,2007,
International Conference on Convergence Information Technology DOI
10.1109/ICCIT.2007.185 0-7695-3038-9/07 $25.00 © 2007 IEEE [195-
197].

[30] Hsin-Chang Yang, Chung-Hong Lee,” Mining Unstructured Web


Pages to Enhance Web Information Retrieval”, 2006, Proceedings of the
First International Conference on Innovative Computing, Information and
Control (ICICIC'06) 0-7695-2616-0/06 2006 [1-3].

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on April 06,2024 at 16:42:21 UTC from IEEE Xplore. Restrictions apply.

You might also like