4Z-Topic 8 - ITC 571 - Week 10

Student Name & CSU
ID
Project Topic Title Analysing the website and measuring the web trends using web crawler based on page tagging for
increasing the productivity of businesses
1|Page
Evaluation & analysis Table week 6
Name- Evaluation of models based on their Customer behaviour prediction:
Purpose- To evaluate the best one solution which is able to detect the consumer behaviour based on the web measuring trend using
the web crawler with aim of increasing the business productivity.
Analytics
Application
Accuracy
base
Customer classification
e-shopping satisfaction
Engagement of URL
Visualization based
Purchase intention
Model Tools
Reference Technique
No. used
Review based
Personal sites
Engagement
Social media
Statistical
Real-time
Web/Data
Social
High
Low
Simulatio
Suchacka&Wo Machine n model
1.     
tzka, 2017 learning
Liu, et. al., Virtual

2. ARIMA   
2018 reality
Service-
3. Li et al.,2016 Polar Hub   oriented  
architectur
2|Page
e
Social
Social media media
Stieglitz et analytics
4.   data and  
al.,2018
4Vs of the
big Data
Google
exploits
DNS weighted
García-Dorado, footprints keywords,
5.    
et. al., 2018 Big Data,
DNS
Cache
KEY
value
store,
6. Deka, 2018 NoSQL       Hadoop,H       
ypercat,
metacrawl
er
collaborati
ve
Hwangbo, et. filtering
7. K-RecSys     
al.,2018 algorithm
8.   WordNet,  
Ireland & Liu, Online product  
Part-of-
3|Page
2018 review Speech
Tagger,
Pling
Stemmer,
Intelligent soft Java API,

Rekik, et. al., computing Search
9. technique  engine 
2018
Balbi, et. Web 2.0 Social

10.     
al.,2018 media
User interest- web graph

Saleheen based web visualizati
11. graph (UIWG).     on 
&Lai,2018
Web 2.0 social Social

12. Alalwan, 2018 media  media ads 
 
Big data fingerprint

Kobusinska, et. fingerprint ing
13.    
al., 2018 analytics
tool
grey
14. Liu, 2018 fuzzy semantics   
situations
and text
and fuzzy
4|Page
mining (GFuzzy)
Resource
dependenc
e theory
hybrid content (RDT)
15. Wu &Lin,2018    
analytics Innovation
diffusion
theory
(IDT)
Social
media and
Duarte, et. al., e-WOM
16.   social  
2018 service
Psychoph
ysiology,
sophisticat
Ciechanowski, ed bots
17. chat bot       
et. al., 2018 and social
robots
Fatehkia, et. Facebook ads Digital

18.     trace of 
al., 2018
data,
5|Page
regression
model
Neural
network
Serrano Neurocomputin and
19. &Gelenbe, g    Intelligent
2018 Internet
Search
Assistant
e-mail tracking Machine
Haupt, et. al., technology
20.   learning  
2018
algorithm
Neuromarketin Neuromar
Eye-tracking keting,
g, website 2.0,
21. technology  website   
Area of internet
2.0, Area
of interest
Home
Gamification-
Area
based
AlSkaif, et. al., Network,
22. residential   
2018 smart
customers
meters,
mobile
and web
6|Page
applicatio
ns, data
analytics
Customer churn Churn

Amin, et. al., prediction
23. predication(CP    
2018
P)
Single source
panel data Latent-
Nakano & Class
24. based    
Kondo, 2018 Cluster
segmentation Analysis
Social
media
typology of
intelligenc
social media
25. Lee, 2018   e, social  
analytics
media
analytics
Result: The proposed table concludes the best two solutions that is one is based on the NoSQL based application in crawling and the
another one based on chat bot for the human interaction
7|Page
Name- Evaluation table based on performance based criteria
Purpose- to evaluate the performance of various models having precise customer behaviour predication, considering
systemavailability, usability, reliability etc. as parameter for analysis the best sought out techniques solution.
Availability Reliability Usability Tracking Reporting
Personalization
Model
documentation
Authorization
Completeness
Reference
Page viewing
E-commerce
Accessibility
Consistency
no.
auditability
Definition/
credibility
Meta data
timeliness
Accuracy
Integrity
visiting
E-mail
17 Ciechanowski, et.
     
al., 2018
11 Saleheen
   
&Lai,2018
12 Alalwan, 2018   
8 Ireland & Liu,

     
2018
6 Deka, 2018        
10 Balbi, et. al.,2018   
18 Fatehkia, et. al.,

   
2018
8|Page
16 Duarte, et. al.,
  
2018
13 Kobusinska, et.
   
al., 2018
5 García-Dorado, et.
  
al., 2018
4 Stieglitz et
      
al.,2018
1 Suchacka&Wotzk
  
a, 2017
7 Hwangbo, et.
  
al.,2018
20 Haupt, et. al.,

      
2018
14 Liu, 2018  
Result-From the above best screen out solution further analysis is made and thus it conclude the method of social media analytics for
the scrape the behavioural data and second and again NoSQL based solution is also found to be better.
9|Page
Name- Business enhancement based evaluation table
Purpose- From the above concludes analysis further analysis is being evaluated by considering the concept of enhancement of
business productivity in respect with the desired area.
Rate of Maintain
Volume of data
Web quality interest backend
retrieval
(ROI) data
Model
Reference Dynami High Low More Less
no Custom
Real Static c Retailer
er
time changes predicat based
based
ion
18 Fatehkia, et. al.,

   
2018
20 Haupt, et. al.,

  
2018
4 Stieglitz et
  
al.,2018
6 Deka, 2018     
8 Ireland & Liu,

  
2018
1 Suchacka&Wotzk
 
a, 2017
7 Hwangbo, et. al.,  
10 | P a g e
2018
20 Haupt, et. al.,

   
2018
Result- From the above analysis it is concluded that the NoSQL based application for crawling is best model solution for the proposed
research due to its characteristics of maintain backend data management and real time enhancement of productivity.
11 | P a g e
Justification
To critically analyse the proposed justification, an evaluation table is being prepared with respect to the web analytics measurement in
the web trend by using the web crawler. The proposed scheme is justifies by undertaking the three evolution criteria. In the first
evolution table based on the predication of the customer behaviour which is the main goal of the research, proposed table concludes
the best two solutions that is one is based on the NoSQL based application in crawling and the another one based on chat bot for the
human interaction. In the second evolution is being made based on the performance criteria which gives the method of social media
analytics for the scrape the behavioural data and second and again NoSQL based solution is also found to be better. At the end of the
evolution analysis is made based on the enhancement of business productivity, which emphases that NoSQL based application is quite
suitable.
Thus, from the above justification, NoSQL based application in crawling will be considers appropriate for the proposed scheme.
12 | P a g e
State of Art- Current best solution
From the above-presented analysis tables, this work concludes the best solution for the presented paper solution best on the NoSQL
based application for the crawling web trend (Deka, 2018). The presented work is applicable in storing the crawler database and
avoids the duplication of the URL. In the above mention hash tables, Map reduced based framework for the NoSQL based application
is quite appropriate for measuring the web trend. The above-presented table evaluates that NoSQL based application for the web
crawling is superior in terms of enhancing the performance, most appropriate for scrapping the consumer behavior and reliable in
terms of improving the quality of the web trend measuring analytics in the manner of incrementing the productivity of the e-commerce
business (Deka, 2018)
13 | P a g e
1. Draw the state of art diagram. Use Blue dotted border for good feature in this work, and the red dotted border for the
limitations of it; the text in the diagram in 'font 8 or 9', and the line spacing inside each box in the diagram should be 'Single',
and the spacing before and after text should be '0, 0' .
Then, write a paragraph to describe the state of art diagram. You need to refer to the figure in your writing.
Map reduced based

NoSQL web crawler
application Illustrated the purpose of the
development of the model
Clarifies the purpose
Making the URL list which will not be reliable
of development
from the URLs
Registering of state of Mapping It evaluates text data

slaves to establish text
index
Collection of the Collection It will avoids the
data from the web duplication of the URL
page which will
Reduce
It rates the process till the
achievement of result
Request the URL

Download URL
Figure1- the above-presented work represent the best current solution for the crawling. Blue dots highlight the feature of the
work and red dots represent the limitation of the work.
14 | P a g e
Analysis and description
The above figure1 represents the processing mechanism of Map Reduced framework of the
NoSQL based database for the crawling application. The process mechanism begins with the
process of determination of the goal and purpose then it will collect the various URL for
crawling and map it in a sense that it will able to avoid the duplication of any URL. The
mapping process also illustrates in registering the state of slaves to crawl the web (Deka,
2018). Process begins with determine what data of the consumer should be scraping and
determine the search engine spider, for their business campaign. After the determination of the
goal of the, user will specify the URL. The mapping process also illustrates in registering the
state of slaves to crawl the web (Deka, 2018). The mapping process also uses the Beevolve Web
Crawler API for monitoring the social media. After the complete registration of the state of the
slaves, now it will follow up the RESTful JSON API in order to obtain the crawling data from
the NoSQL database. Crawling application will start generating the crawling data by merging the
URL list retrieved from the crawler frontier. The main feature of this work is that it will avoid
the duplication of any URL. Reducing stage will enhance the quality of the web by scrapping
the best web pages first. It will also integrate the reducing phase by holding the virtual hosting in
the IP address (Deka, 2018).
15 | P a g e
2. Describe the Stages/Parts of the system.
Determination of the goal of E-commerce business

At this phase, e-commerce business should plan a vision of their business motive of the target
customer, determine what data of the consumer should be scraping and determine the search
engine spider, for their business campaign (Deka, 2018). In the above framework, a commercial
search engine spider is generated which is liable to scrape the new resources in the net in order to
enhance the business productivity.
Specifying the URL for crawling

After the determination of the goal of the, user will specify the URL which has to be crawled via
the Graphical User Interface of the crawler. In that phase, the entire URL is being collected and
mold in the crawler frontier and specifies the URLs list.
Mapping process
After specifying the URL list generated from the GUI crawler, spider search engine will start
generating the URL and making the list of the URL from the retrieved crawler frontier and will
allocate all the links presented in the URL list campaign. The mapping process in the crawling is
based on the human-machine interaction in the new grey area of the It. The mapping process also
illustrates in registering the state of slaves to crawl the web (Deka, 2018). The mapping
process also uses the Beevolve Web Crawler API for monitoring the social media. After the
complete registration of the state of the slaves, now it will follow up the RESTful JSON API in
order to obtain the crawling data from the NoSQL database. These processes will also use the
80legs crawler for cost-effective distribution of the web crawler in the web measuring.
Collection of the crawling data

After the successful implementation of the mapping process, the crawling application will start
generating the crawling data by merging the URL list retrieved from the crawler frontier. The
main feature of this work is that it will avoid the duplication of any URL. Again, it will start
adjusting the Repositories of the URL and the documents. The query, which is to be inserted in
the crawling, is likely to be HTML and RDF and finally stored the extracted links retrieved from
the crawler frontier.
16 | P a g e
Reducing stage
After storing the cumulative crawling for spider search engine, it will start reducing mechanism,
which will request the URL from the various web pages and will reduce it by downloading the
URL in search engine. Further, it will go on in order for addressing and rates the process until the
achievement of the result. It will make use of the semantic based web data retrieval, which
will be liable for enhancing the data storage of the spider search engine also increase the
processing power of the web crawler. Reducing stage will enhance the quality of the web by
scrapping the best web pages first. It will also integrate the reducing phase by holding the virtual
hosting in the IP address (Deka, 2018).
Thus from the above classifies stages, it is concluded that web crawler application should be
actually fit with the map reduced based programming model. Below there is mechanism,
illuminated in figure 2 & 3, for the map reduced programming model and illustrates the process
criteria of the crawler frontier.
17 | P a g e
Downloading
Extracting links Crawler Frontier
Storing extracted links
Figure-2 crawl the web pages
Determination of the goal of E-commerce business
Specifying the URL for crawling
Mapping process
Collection of the crawling data
Reducing stage
Figure 3- map reduced based programming model
18 | P a g e
3. Based on the purpose of the stage you are work on it, you need to provide the steps that
used to reach the goal. You also need to mention the important of the state of art
features in your project. You also need to mention the limitations if they are available in
this stage. You also need to mention why this part has got this limitations (JUSTIFY).
Mapping process
The mapping process in the crawling is based on the human-machine interaction in the new
grey area. The mapping process also illustrates in registering the state of slaves to crawl the web
(Deka, 2018). The mapping process also uses the Beevolve Web Crawler API for monitoring the
social media. After the complete registration of the state of the slaves, now it will follow up the
RESTful JSON API in order to obtain the crawling data from the NoSQL database. These
processes will also use the 80legs crawler for cost-effective distribution of the web crawler in the
web measuring.
Reducing stage
It will go on in order for addressing and rates the process until the achievement of the result. It
will make use of the semantic based web data retrieval, which will be liable for enhancing the
data storage of the spider search engine also increase the processing power of the web crawler.
Reducing stage will enhance the quality of the web by scrapping the best web pages first. It will
also integrate the reducing phase by holding the virtual hosting in the IP address (Deka, 2018).
However, in the core process between the mapping and reducing stage there is a fear of
malicious attack whiles the collection of URL from the URL list.
19 | P a g e
4. Describe the system/model and the output of it, and clarify if the output is accepted in
your project domain or not.
Then you should provide the limitation of it, and where and why this limitation occurs
(Analyse and JUSTIFY).
Map reduced based framework for the NoSQL based application is quite appropriate for
measuring the web trend. It is able to scrape the consumer behaviour accuracy increased up to
30-40%. NoSQL based database will result in better utilisation of crawler frontier in order to
maintain the spider search engine. .However, it seemed that crawler facing the core issue in
providing the security concern in each platform. NoSQL database is not a specified programming
interferes but, the adoption of Hadoop Talent Bin and Key value store for the database backend
solution and big data volumeis illustrated which is able to connect them with the DNS server.
Honeypot computing is will be adopted for detecting the unauthorized malicious attacks.
20 | P a g e
5. You also need to draw the logical flow diagram of it.
Algorithm- Predication based UDT and LTD, training process set and distance factor
Input- matrix with m × n elements for defining the training set
Output- predicting result based on UDT and LTD predication in the social media analytics.
Step1: BEGIN
Step2: Get the resource prediction on the basis of the input parameter
Step3: data pre-processing stage
Step 4: enabling the distance factor based matrix
Step5: selection of sample based on the
W= I, where i= (100,200,300……n)
Step6: defining the issue based on,

𝑛𝑥 𝑛(𝑛−1)𝑥 2
Accuracy- TP+TN(1 + 𝑥)𝑛 = 1 + + +⋯
1! 2!
Step7: defining the training process,
T=w+1..(N-w-1)
Step8: validation phase of L and U
L=1..W and U=N-w….N
21 | P a g e
Start
Input thematrix with m × n elements

for defining the training set
Data pre-processing stage
Distance factor enabling Assuming A

as the matrix
TP+TN(1 +
𝑥)𝑛 = 1 + Defining accuracy
𝑛𝑥
+
1!
𝑛(𝑛−1)𝑥 2 𝑣1
+⋯
2! ⬚
A൦ ൪= vij×Rm*n
⬚
Defining the training process,
… 𝑣1, 𝑛
T=w+1... (N-w-1)
Figure4-a flowchart of churn prediction in the social media analytics
22 | P a g e
Appendix
For the equation
T= training set
W= web crawl
N= no of the training set
𝑛𝑥 𝑛(𝑛−1)𝑥 2
TP+TN(1 + 𝑥)𝑛 = 1 + + + ⋯ = accuracy determination based on training set and
1! 2!
benchmark framework
For abbreviation
HTTP= Hypertext transfer protocol
URL= Uniform Resource Locator
IP = Internet protocol
GUI = Graphical User Interface
LTD and UTD= Lower transfer distance and Upper transfer distance
23 | P a g e
Week-8
1. Give introductory about the idea of the proposed solution that comes from the features of
the first and second/third best solutions.
After evaluating various technologies for crawling based application for scrapping the consumer
behaviour, this research work had highlighted the pros and cons. of the various methods based on
the issue of efficiency, accuracy, reliability, traceability and statistical approaches.
In this research, NoSQL presented by Deka. (2018) found to be the most effective application for
crawling web trends. It is having the feature of avoiding copies or duplication of URLs, which
helps to increase web trends. A NoSQL technique enabled businesses is used in measuring
analytics and helps to increase the productivity of e-commerce businesses (Deka. 2018).
Though state of art solution has many advantages, the security concern is the most focused
limitation that occurred on each platform. Here it has also seen that machine-human interaction
is required in a way to get the ratings on the pages. Hence, to overcome the limitations which
has occurred in the state of art solution, Hadoop talent bin and honeypot computing is
proposed, this method is far different and has a significant work contribution for the above
limitation. Proposed technology is having features of defending malware attacks seemed at the
time of acquisition of the URL from the URL list from the crawler frontier.
24 | P a g e
2. Draw the proposed solution diagram. Take the copy of the state of art diagram, and only
change the text of the placed that you enhanced to propose the new solution. Use Green
dotted border for new feature in this work, and remove the red dotted border; the text in
the diagram in 'font 8 or 9', and the line spacing inside each box in the diagram should be
'Single', and the spacing before and after text should be '0, 0’. Then, write a paragraph to
describe the state of the art diagram. You need to refer to the figure in your writing.
25 | P a g e
26 | P a g e
Figure (3) - Introduction of proposed feature for limitation in state of art
3. Describe the proposed system diagram; also refer to the figure number in your describing.
What are the components of the proposed system, and refer to the diagram? Show the
NEW features and workflow of the proposed system compared with the state of art.
In the figure (3), features of proposed and state of art solutions are given. At the initial stage,
mapping is done through the three stages that are based on criteria of URL list preparation,
state of slave registration and text data evaluation. Further, data collection procedure takes
place. In which all the data is collected. The proposed system consists of three major stages (Fig.
3) mapping, de-duplication from the NoSQL and reducing.
The research is focused on the elimination of chances of external malicious attacks. To

overcome the limitation, honeypot method has proposed which mentioned in the figure.
Honeypot method enables system to prevent malicious attacks occurred at the time of
transmission of URL to the registered slave for the Site promotion.
Proposed method routed the attacks to the replicas of system, which has prepared virtually to
prevent the malicious attacks. In the process, if the attacks found malicious it redirected to the
virtual system. On the other hand, if attack found normal than it gone for the de-duplication
process. Here, state of art solution eliminates the chances of duplication of URLs. Reducing
stage will later on starting the initialising for the promotion of URL at the top of the search
engine for the e-commerce business website.
27 | P a g e
4. What are the new proposed components that you Modified and Enhanced. What is the
purpose of each modified and enhanced, and the impact of each on your results (how
each solved the limitations in state of art).
In the 4thstage, the proposed method Honeypot has proposed. It is found that the honeypot
method is effective in eliminating the chances of malicious attacks, which is generated from the
attacker end and finish at the system. Honeypot method is proposed in the way to reduce the
chances of external malware attacks that reduce the overall efficiency of the system. Honeypot
method eliminates the malicious attacks by routing them to the virtual system instead of actual
system.
In the Honeypot method stage, external attacks are focused and the proposed method route the
attacks to the virtual system rather than the actual system. Current best solution found
ineffective in a way to prevent external malicious attacks. However, Honeypot might be an
appropriate choice for overcoming the limitation. Hadoop method provides feature of detecting
and control the crawling of the various web pages. It has found that there are high chances of
crawling on web pages (Amin et al. 2018).
28 | P a g e
5. AREA OF IMPROVEMENT: Show how many components you modified, and how
you modified, what is the importance of each modified equation and what has solved.
For justification of this research, proposed solution is consisting of an equation.

Honeypot method eliminates the malicious attacks by routing them to the virtual system
instead of actual system. The proposed equation is based on attributes of de-duplication
of URL and the attributes of an eliminating of malicious attack. Both are considered
below-
URL de-duplication equation-

𝑝𝑐 = 𝑝𝑟 {𝑑𝑒𝑙𝑎𝑦 ≥ 𝑡𝑛𝑜𝑤 − 𝑇𝑐 }
∞
1 (𝑥 − 𝜇𝑐 )2
= ∫ 𝑒𝑥𝑝 (−
𝜎𝑐 √2𝜋 𝑡𝑛𝑜𝑤 −𝑡𝑐 2𝜎𝑐2
Eliminating of malicious attack-
𝒅(𝒑, 𝒒) = √(𝒑𝟏 − 𝒒𝟏 )𝟐 + (𝒑𝟐 − 𝒒𝟐 )𝟐 + (𝒑𝟑 − 𝒒𝟑 )𝟐
(𝒑𝟏 − 𝒒𝟏 )𝟐 + (𝒑𝟐 − 𝒒𝟐 )𝟐 + (𝒑𝟑 − 𝒒𝟑 )𝟐

𝒅(𝒑, 𝒒) =
𝟐
The above-mentioned equation will focus on security and de-duplication of URLs
However, In the 4thstage, the proposed method Honeypot has proposed. It is found that the
honeypot method is effective in eliminating the chances of malicious attacks, which is generated
from the attacker end and finish at the system. In the stage, external attacks are focused and the
proposed method route the attacks to the virtual system rather than the actual system. Current
best solution found ineffective in a way to prevent external malicious attacks. Current best
solution found ineffective in a way to prevent external malicious attacks. However, Honeypot
might be an appropriate choice for overcoming the limitation. Hadoop method provides the
feature of detecting and control the crawling of the various web pages. It has found that there are
high chances of crawling on web pages (Amin et al. 2018).
29 | P a g e
6. Show the CONTRIBUTION of your proposed system, and the IMPORTANT (WHY) of it.
Also, compare your proposed solution with the state of art to show the contribution.
Hadoop method –
Hadoop method provides feature of detecting and control the crawling of the various web pages.
It has found that there are high chances of crawling on web pages (Amin et al. 2018). It also
enabled users to explore complex data and assisting in semi-structure data analysis. This method
used in two phases those are data distribution and process of isolation. In the state of art solution,
limitation such as duplication of URLs was covered and eliminated. This increase the efficiency
of websites as URLs duplication slightly reduced.
In the state of art solution NoSQL based algorithm proposed to manage web crawling for its
optimization. Further, NoSQL based crawling application also helps to reduce the number of
URL duplication, which causes the issue of web site inefficiency.
Limitations of URLs duplication and security issues have recognised the issues associated with
the proposals. Methods have been evaluated to overcome the limitation of malicious attacks and
honeypot method found as the most effective method to increase the security of systems.
Honeypot method has the advantage of redirecting malicious attacks to the virtual systems
(Avery & Wallrabenstein. 2018). Although, systems are found insecure from the external attacks
in the previous methods Honeypot method enables users to manage their systems by developing
replicas of actual system.
Due to high accuracy of proposed Honeypot system, future systems could be safe from the
external attacks with the low cost of resources.
30 | P a g e
7. Provide a comparison table between your proposed and state of art solutions. This table
should be built based on what you described in point 6.
Basis of differentiation Proposed solution State of Art solution
Name of technique Honeypot NoSQL based application
Level of accuracy Level of accuracy is more It has the high accuracy in

superior because of initialising of scrapping consumer behaviour
virtual system, the accuracy level with a range of up to 7-9
it seemed to be alignment in
range of 8 out of 10
Feature of work Create a virtual system for Map- reducing based

eliminating any malicious attack framework for avoiding any
duplication of URLs.
Limitation It can unfilter and harm the e- Security and malicious attack
commerce website once engaged
in attacking.
Contribution-1 Overcoming the limitations of NoSQL based crawling
URLs duplication and security application also helps to reduce
issues have recognised the issues the number of URL duplication,
associated with the proposals. which causes the issue of website
inefficiency.
Contribution 2 Creating a de-duplication and Adopting Hadoop talent bin and
eliminating malicious attack key-value store attributes.
equation.
31 | P a g e
8. You need to draw the logical flow diagram of it.
Algorithm- Invertible Bloom lookup table
Input- p and q as number of instruction and branched respectively
Output- (𝑡𝑎𝑘𝑖𝑛𝑔 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑝 𝑎𝑛𝑑 𝑞 𝑎𝑠 𝑖𝑛𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛) or 𝑛𝑢𝑙𝑙
Step1: BEGIN
Step 2: 𝑤ℎ𝑖𝑙𝑒(0 < 𝑝 𝑎𝑛𝑑 𝑞 <= 1)𝑑𝑜
Step 3: 𝑖𝑓
𝑑(𝑝, 𝑞) = √(𝑝1 − 𝑞1 )2 + (𝑝2 − 𝑞2 )2 + (𝑝3 − 𝑞3 )2
Step 4: checking process for malicious attack
Step 5: 𝑖𝑓 𝑎𝑡𝑡𝑎𝑐𝑘 𝑖𝑠 𝑚𝑎𝑙𝑖𝑐𝑖𝑜𝑢𝑠 = 𝑐ℎ𝑒𝑐𝑘
Step 7: system redirection phase.
Step 8: END
32 | P a g e
Figure (4) – Generic flow diagram of proposed system
33 | P a g e
3. Expected Results and Discussion
I. Positive outcomes and implication
The proposed model has positive outcomes after the implication of a web crawler in an e-
commerce business in order to scrape consumer for increasing business productivity. The
system has includes various tools and techniques which emerged as enhance tool to improve the
crawling efficiency for scraping in Big web data with more accurately with respect to the
tentative customers (Deka. 2018). Web crawling assists business for instead data gathering
instead of manual efforts, aids in order to compare business with the competitors business by
scrapping of competitors website. A web crawling is a good measure for a business side for
Neuromarketing approaches by adopting an updated database of tentative audience (Wu et al.
2018). Proposed scheme will indulge NoSQL based application that is including Hadoop talent
bin, key-value store, map-reduce framework and Apache Cassandra for scraping the wen data in
the proposed spider search engine (Haupt et al. 2018). Another tool that the proposed model will
be including is Honeypot model which confronts the security of URL database and fascinate
rapid attack over the jamming and malicious attack. Honeypot method has the advantage of
redirecting malicious attacks to the virtual systems. Although, systems are found insecure from
the external attacks in the previous methods Honeypot method enables users to manage their
systems by developing replicas of actual system (Deka. 2018).
34 | P a g e
II. Issue/challenges with implementation
The proposed solution is dealing with many issues after implementation of various tools
and technique in one solution. The proposed scheme is including honeypot model which
is able to redirect the attack to a Virtual device, it performs its task based on developing
replicas of actual system (Amin et al. 2018). However besides eliminating malicious
attack and ensuring security, Honeypot model have limitation based on network
forensics. Some expert concludes that the implication of honeypot is found to be an
unethical issue. Honeypot model also creating the disadvantage of enhancing Hijacking
activity or they are building hacker better. Since hackers are trying many activities to
train them and know more about the concept of honeypot model, thus for a firm honeypot
model is neither be an appropriate choice for security (Ireland & Liu. 2018). Virtual
system may compress the actual data. URL in message is broken when honeypot
establishes virtual device. (Wu et al. 2018).
35 | P a g e
Conclusion
Reiterate the purpose of the research. Summarise results/findings. Acknowledge limitations of
the research focusing on methodology, the model and implementation
The research has improved the scrapping accuracy of crawler by up to 8-10&% through
implanting proposed model of NoSQL application with Honeypot model. The study has focused
on the implementation of NoSQL based crawling application in order to provide database storage
to the crawler frontier in order to avoid duplication of any URL.
A web crawler acts as the spider in the era of web. A web crawler search web content based on
its crawling. It acts as the automated scripted program that is browsing the web in a systematic
order. A web crawling is a good measure for a business side for Neuromarketing approaches by
adopting an updated database of tentative audience.
The main of this research paper is to develop an adoption of web crawler application in the
modern web analytics tools in for website measuring in order to increase business productivity.
The implemented scheme is including NoSQL based crawling application. Map-reduce based
framework evaluated as the appropriate method that avoids any duplication of URL in the
crawler frontier.
Further, this research work found security and malicious attack problem in the current best
solution. To overcome these issues a proposed solution is being introduced by implanting
Honeypot model that of redirecting malicious attacks to the virtual systems. However further it
is also being concluded that honeypot model too engaged with many technical and ethical issues
but they can be overcome future by introducing various tools and techniques such as VMware,
improving detection configuration.
The proposed model provides the accuracy in scrapping consumer behaviour is almost 8-10%
and crawler efficiency is also being increased by 10%. Proposed model is more reliable than the
current best solution in order to increase the level of security by 15%.
36 | P a g e
Future Work
Suggest areas of research and the future direction. What needs to be done as a result of your findings
focusing on the weaknesses identified
A web crawler acts as the spider in the era of web. A web crawler search web content based on
its crawling. It acts as the automated scripted program that is browsing the web in a systematic
order. Future of web crawler is bright as it assists in the scrapping of web pages for fetching and
extraction.
In this research paper, the capability of web crawler in web analytics is being considered in the
eye for increasing business productivity. The implementation of crawler in e-commerce website
seems some weakness such as complex big data, customer engagement for online shopping, web
rating, scraping consumer behavior and classifying tentative customer, complexity in
Neuromarketing and security and privacy of the consumer.
Web crawling is a going on a project for e-commerce business sight of view. Social analytics,
web analytics, NoSQL application, cyber-infrastructure, eye tracking technology, and
Neuromarketing evaluated as the best methods that can be adopted in the future for increasing
crawling effectiveness in future
37 | P a g e
References
Alalwan, A. A. (2018). Investigating the impact of social media advertising features on customer
purchase intention. International Journal of Information Management, 42, 65-77.
AlSkaif, T., Lampropoulos, I., van den Broek, M., & van Sark, W. (2018). Gamification-based
framework for engagement of residential customers in energy applications. Energy
Research & Social Science, 44, 187-195.
Amin, A., Al-Obeidat, F., Shah, B., Adnan, A., Loo, J., & Anwar, S. (2018). Customer churns
prediction in telecommunication industry using data certainty. Journal of Business
Research.
Avery, J., &Wallrabenstein, J. R. (2018).Formally modeling deceptive patches using a game-

based approach. Computers & Security, 75, 182-190.
Balbi, S., Misuraca, M., &Scepi, G. (2018).Combining different evaluation systems on social
media for measuring user satisfaction. Information Processing & Management, 54(4),
674-685.
Ciechanowski, L., Przegalinska, A., Magnuski, M., &Gloor, P. (2018). In the shades of the
uncanny valley: An experimental study of human–chatbot interaction. Future Generation
Computer Systems.
Deka, G. C. (2018). NoSQL Web Crawler Application.In Advances in Computers (Vol. 109, pp.
77-100).Elsevier.
Duarte, P., e Silva, S. C., & Ferreira, M. B. (2018). How convenient is it? Delivering online
shopping convenience to enhance customer satisfaction and encourage e-WOM. Journal
of Retailing and Consumer Services, 44, 161-169.
Fatehkia, M., Kashyap, R., & Weber, I. (2018).Using Facebook ad data to track the global digital
gender gap. World Development, 107, 189-209.
García-Dorado, J. L., Ramos, J., Rodríguez, M., & Aracil, J. (2018). DNS weighted footprints
for web browsing analytics. Journal of Network and Computer Applications, 111, 35-48.
38 | P a g e
Haupt, J., Bender, B., Fabian, B., &Lessmann, S. (2018). Robust identification of email tracking:
A machine learning approach. European Journal of Operational Research.
Hwangbo, H., Kim, Y. S., & Cha, K. J. (2018).Recommendation system development for fashion
retail e-commerce. Electronic Commerce Research and Applications, 28, 94-101.
Ireland, R., & Liu, A. (2018). Application of data analytics for product design: Sentiment
analysis of online product reviews. CIRP Journal of Manufacturing Science and
Technology.
Kobusińska, A., Pawluczuk, K., &Brzeziński, J. (2018).Big Data fingerprinting information

analytics for sustainability. Future Generation Computer Systems.
Lee, I. (2018). Social media analytics for enterprises: Typology, methods, and
processes. Business Horizons, 61(2), 199-210.
Li, W., Wang, S., & Bhatia, V. (2016).PolarHub: A large-scale web crawling engine for OGC
service discovery in cyberinfrastructure. Computers, Environment and Urban
Systems, 59, 195-207.
Liu, J. W. (2018). Using big data database to construct new GFuzzy text mining and decision
algorithm for targeting and classifying customers. Computers & Industrial Engineering.
Liu, Y. Y., Tseng, F. M., & Tseng, Y. H. (2018).Big Data analytics for forecasting tourism
destination arrivals with the applied Vector Autoregression model. Technological
Forecasting and Social Change, 130, 123-134.
Muñoz-Leiva, F., Hernández-Méndez, J., & Gómez-Carmona, D. (2018).Measuring advertising

effectiveness in Travel 2.0 websites through eye-tracking technology. Physiology
&behavior.
Nakano, S., & Kondo, F. N. (2018).Customer segmentation with purchase channels and media
touchpoints using single source panel data. Journal of Retailing and consumer
services, 41, 142-152.
39 | P a g e
Rekik, R., Kallel, I., Casillas, J., &Alimi, A. M. (2018).Assessing websites quality: A systematic
literature review by text and association rules mining.International Journal of
Information Management, 38(1), 201-216.
Saleheen, S., & Lai, W. (2018).UIWGViz: An architecture of user interest-based web graph
visualization. Journal of Visual Languages & Computing, 44, 39-57
Serrano, W., &Gelenbe, E. (2018). The Random Neural Network in a neurocomputing

application for Web search. Neurocomputing, 280, 123-134.
Stieglitz, S., Mirbabaie, M., Ross, B., & Neuberger, C. (2018).Social media analytics–
Challenges in topic discovery, data collection, and data preparation. International
Journal of Information Management, 39, 156-168.
Suchacka, G., &Wotzka, D. (2017).Modeling a non-stationary bots’ arrival process at an e-

commerce Web site. Journal of computational science, 22, 198-208.
Wu, P. J., & Lin, K. C. (2018).Unstructured big data analytics for retrieving e-commerce
logistics knowledge. Telematics and Informatics, 35(1), 237-244.
40 | P a g e

4Z-Topic 8 - ITC 571 - Week 10

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4Z-Topic 8 - ITC 571 - Week 10

Uploaded by

Copyright:

Available Formats

Student Name & CSU

Name- Evaluation of models based on their Customer behaviour prediction:

Liu, et. al., Virtual

Intelligent soft Java API,

Balbi, et. Web 2.0 Social

User interest- web graph

Web 2.0 social Social

Big data fingerprint

Fatehkia, et. Facebook ads Digital

Customer churn Churn

Availability Reliability Usability Tracking Reporting

8 Ireland & Liu,

18 Fatehkia, et. al.,

20 Haupt, et. al.,

18 Fatehkia, et. al.,

20 Haupt, et. al.,

8 Ireland & Liu,

7 Hwangbo, et. al.,  

20 Haupt, et. al.,

Map reduced based

Registering of state of Mapping It evaluates text data

Request the URL

Determination of the goal of E-commerce business

Specifying the URL for crawling

Collection of the crawling data

Extracting links Crawler Frontier

Storing extracted links

Figure-2 crawl the web pages

Determination of the goal of E-commerce business

Specifying the URL for crawling

Collection of the crawling data

Figure 3- map reduced based programming model

Step3: data pre-processing stage

Step 4: enabling the distance factor based matrix

Step5: selection of sample based on the

Step6: defining the issue based on,

Step7: defining the training process,

Step8: validation phase of L and U

L=1..W and U=N-w….N

Input thematrix with m × n elements

Data pre-processing stage

Distance factor enabling Assuming A

Figure4-a flowchart of churn prediction in the social media analytics

For the equation

N= no of the training set

HTTP= Hypertext transfer protocol

URL= Uniform Resource Locator

GUI = Graphical User Interface

The research is focused on the elimination of chances of external malicious attacks. To

For justification of this research, proposed solution is consisting of an equation.

URL de-duplication equation-

Eliminating of malicious attack-

𝒅(𝒑, 𝒒) = √(𝒑𝟏 − 𝒒𝟏 )𝟐 + (𝒑𝟐 − 𝒒𝟐 )𝟐 + (𝒑𝟑 − 𝒒𝟑 )𝟐

(𝒑𝟏 − 𝒒𝟏 )𝟐 + (𝒑𝟐 − 𝒒𝟐 )𝟐 + (𝒑𝟑 − 𝒒𝟑 )𝟐

The above-mentioned equation will focus on security and de-duplication of URLs

Basis of differentiation Proposed solution State of Art solution

Name of technique Honeypot NoSQL based application

Level of accuracy Level of accuracy is more It has the high accuracy in

Feature of work Create a virtual system for Map- reducing based

Algorithm- Invertible Bloom lookup table

Input- p and q as number of instruction and branched respectively

Output- (𝑡𝑎𝑘𝑖𝑛𝑔 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑝 𝑎𝑛𝑑 𝑞 𝑎𝑠 𝑖𝑛𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛) or 𝑛𝑢𝑙𝑙

Step 2: 𝑤ℎ𝑖𝑙𝑒(0 < 𝑝 𝑎𝑛𝑑 𝑞 <= 1)𝑑𝑜

Input thematrix with m × n elements