An Efficient Cloud Based Approach For Service Crawling

Short Paper ACEEE Int. J. on Information Technology, Vol. 3, No.
1, March 2013
An Efficient Cloud based Approach for Service Crawling

Chandan Banerjee 1, 2, Anirban Kundu 2, 3, Sumon Sadhukhan1, Rana Dattagupta4
Netaji Subhash Engineering College, Kolkata 700152, India {chandanbanerjee1, sumon.sadhukhan8}@gmail.com 2 Innovation Research Lab (IRL), Howrah, West Bengal 711103, India anik76in@gmail.com 3 Kuang-Chi Institute of Advanced Technology, Shenzhen 518057, P.R.China anirban.kundu@kuang-chi.org 4 Jadavpur University, Kolkata 700032, India rdattagupta@cse.jdvu.ac.in
Abstract In this paper, we have designed a crawler that searches services provided by different clouds connected in a network. Proposed method provides details of freshness and age of cloud clusters. Crawler checks each router available in a network providing services. On basis of search criteria, our design generates output guiding users for accessing requested cloud services in efficient manner. We have planned to store the result in an m-way tree and to use traversal technique for extraction of specific data from the crawling result. We have compared the result with other typical search techniques. Index Termscloud crawler, service crawling, cloud search, Freshness, Age
1
surfacing. The challenge has been studied by several researches such as [5], [6], [7], [8], [9]. In these methods, candidate query keywords are generated from the obtained records.Section II shows our proposed framework and the corresponding approach. Experimental analyses are presented in Section III. Section IV concludes the paper. II. FRAMEWORK We consider that there are several nodes which are connected to each other in a network fashion. Clusters are formed with several nodes providing distinct services. The head node is also connected with the network. Cluster may have private networks recursively. The crawler will reach the end point and take information from them and send them to the head node. The Node A, stores the whole result. Boxes are indicating networks. A network may have a sub-network.In the second section, we use M-Way tree traversal technique so that we can reach the destination with minimum path length. In the last section we show how the technique is efficient in comparison with other searching algorithm. To realize the efficiency of the algorithm we need to understand about the Freshness and Age of crawler. Every crawler has to update fast the database and produce efficient result. The terms freshness and age involve the Database. A. Freshness and Age A cloud service database is called fresher when it has updated information with other crawlers. For an instance if a crawler crawls more nodes than other crawlers then it is fresher. If a crawler shows a result of 5 min ago then it is its age. 1. Freshness Let S = {n1, n2, n3nn} is the total amount of node in the network; where n1, n2 are nodes and N is the number of elements. D1, D2, , Dn are the service stored on the particular node. Total freshness of the crawler is, Freshness (tn) = 1/N i=1N F(ni,t); Where F(ni,t) = 0 if not updated = 1 if updated at time t 2. Age Let {T1, T2 Tn} is the time set, when the information about 61
I. INTRODUCTION In modern life, the usage of cloud is growing in a rapid way. Cloud user typically relies on specific services. Web search engines [1] crawl the web and update information world-wide. Now-a-days, Internet users are switching from single service to cloud service requiring more availability of cloud service. Web crawlers [2] store data after fetching web pages and cache them into their database. Every crawler stores the crawled result in its database and result is searched when it is needed. The search Engines [3] are often compared with other search Engines with time complexity and space complexity. Freshness and Age of crawled result are also considerably important. Cloud crawler [4] works with Internet Protocol (IP) addresses of a cache stored in a tree structure. Hosts are visited using specific threads for specific networks. Frequently, one needs to maintain local copies of remote data sources for better performance or availability. For example, Web search engine copies a significant subset of the Web and maintain copies or indexes of the pages to help users access relevant information.In this situation, a part of the local copy may get out-of-date because changes at the sources are not immediately propagated to the local copy. Therefore, it becomes important to design a good refresh policy that maximizes the freshness of the local copy. As the cloud services grow larger, it becomes more important to refresh the data more effectively.One critical challenge in surfacing approach is how a crawler can automatically generate promising queries so that it can carry out efficient 2013 ACEEE DOI: 01.IJIT.3.1. 1114
Short Paper ACEEE Int. J. on Information Technology, Vol. 3, No. 1, March 2013 the specific node is taken into account. The current time is T. Then, the age of the node is {T-Tn}. At time t, if the age of an element is Ai, then Ai = 0 (if it is updated at t) Ai= Ti Ti-1 (if it is not updated at t) Total Time of the A(s,t) = 1/N i=1NAi A cloud crawler is used to fetch the services for creating a framework of cloud service crawler engine using proper indexing methodologies. A crawler for a specific service is a program for extracting outward Web links (URLs) and further adding them into a list after processing. Thus, a cloud service crawler is a program which fetches as many relevant services as possible for the specific users. It uses the Web link structure in which the order of the list is important, because only high quality Web pages are considered as relevant. Fig. 1 shows the proposed service based cloud crawler. Here, an element insertion means that the element is inserted at the pointer location within the m-way tree. A special traversal technique is utilized for visiting all the nodes within each network or sub-network. Each node is selected twice. Second time it is actually popped from stack. An advantage of our algorithm is that data need not to be stored in the client node. The result is directly sent to the crawler server after scanning a single node.
Fig. 2. Arbitrary Cloud Cluster Scenario
In crawling run time a hash table is made mapping with the Node and Number (IP-address) of resources in a cloud network which is shown in Table 2. Our proposed search approach shows in subsection E.Sample network is being crawled using proposed method which is shown in Table I.
TABLE I. PROPOSED APPROACH BASED ON FIG. 2
Fig. 1. Flowchart of Service based Cloud Crawler
B. Sample Procedure of a Sample Network Fig. 2 shows an arbitrary cloud cluster. There are total four network clusters within a cloud. Circular boxes indicate the clusters and rectangular boxes indicate the resources of each cluster network. Table 1 show the result which is based on our proposed approach as shown in our previous work [1]. 2013 ACEEE DOI: 01.IJIT.3.1.1114 62
Short Paper ACEEE Int. J. on Information Technology, Vol. 3, No. 1, March 2013
C. Hash Table The hash table is generated based on the mapping between the Node and Number (IP-address) of resources in a cloud network. Table II is created using real-time crawling.
TABLE II. H ASH TABLE BASED ON TABLE I
D. Indexing Result Crawler finishes searching the cloud; and, then stores the result into an M-Way tree using Table II based on Fig. 3. E. Search Approch The algorithm described in Fig. 4 is used to reach any node using the crawling result. Consider, Node 13 is to be 2013 ACEEE DOI: 01.IJIT.3.1.1114 63
Short Paper ACEEE Int. J. on Information Technology, Vol. 3, No. 1, March 2013
TABLE III. PROPOSED SEARCH APPORACH
Fig. 3. M-Way tree
visited in a particular time instance. Table 3 shows different steps to search Node 13.
Fig. 4. Flow chart to reach any node using Fig. 2
The shortest path to reach Node 13 is {1 III. EXPERIMENTAL ANALYSIS
11
13}.
We know, time complexities [10] [11] of DFS and BFS are O(|V|+|E|); where V= vertices of the graph and E =Edge of graph; A. Best Case Scenario 1) Breath First Search (BFS) Total Number Nodes visited=MN; where M= Average Number of machine present in every network. N=Level of Tree. 2) Depth First Search (DFS) Total Number of Node Visited= N, where N=Level of tree. 3) Based on our Proposed Algorithm Total Number of Node Visited= N, where N=Level of tree. The best case analysis has been shown in Fig. 5. Our algorithm has been compared with typical DFS and BFS methods. With the help of comparative study we conclude that number of Fig. 5. Best Case Complexity visited node would be increased with the increment of level of m-way Tree. With the help of our proposed searching B. Worst Case Scenario method, we can find out shortest the path to reach every 1)Breath First Search (BFS) node. 64 2013 ACEEE DOI: 01.IJIT.3.1.1114
Comparison
Short Paper ACEEE Int. J. on Information Technology, Vol. 3, No. 1, March 2013 Total Number of Node Visited = M^(N+1) 2) Depth First Search (DFS) Total Number of Node Visited = M^(N+1) 3) Based on our Proposed Algorithm Total Number of Node Visited = N Minimum time complexity has been achieved to reach any destination node using our proposed algorithm in worst case analysis. Fig. 6 shows the worst case complexity analysis comparison. CONCLUSIONS In our methodology, a Hash-table is generated in which each resource is assigned with a particular number. The Hash table is helpful for identification of each node. It is also useful to find out shortest path for reaching any node (resource) within the table. Freshness and age of a result can be calculated with the help of hash-table comparing the past and present results of the particular nodes. In different network different machines have same IP address; it can be identified by hash-table because it allocates unique number to each machine. Minimal numbers of nodes are being visited in proposed method compared to DFS or BFS. REFERENCES
[1] Brin, S., Page, L., The anatomy of a large-scale hyper textual Web search engine, Computer Network ISDN Syst. 30, 1998, pp. 107-117 [2] Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J., An Approach to Deep Web Crawling by Sampling, Web Intelligence 2008, pp. 718-724 [3] Yang, Kai-Hsiang, Pan, Chi-Chien, Lee, Tzao-Lin, Approximate search engine optimization for directory service, Parallel and Distributed Processing Symposium, 2003, Dept. of Comput. Sci. & Inf. Eng., Nat. Taiwan Univ., Taipei, Taiwan [4] C.Banerjee, A.Kundu, S.Sadhukhan, S.Bose, R.Dattagupta ; Service Crawling in Cloud Computing; 2nd International Conference on Advances in Information Technology and Mobile Communication, CCIS 296, pp. 243~246, Springer-Verlag Berlin Heidelberg Publication [5] Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Googles Deep-Web Crawl. In Proceedings of VLDB2008. Auckland, New Zealand, pp. 12411252 (2008) [6] Ntoulas, A., Zerfos, P., Cho, J.: Downloading Textual Hidden Web Content through Keyword Queries. In Proceedings of JCDL2005. Denver, USA. pp. 100109 (2005) [7] Barbosa, L., Freire, J.: Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proceedings of SBBD2004, Brasilia, Brazil, pp. 309321 (2004) [8] Liu, J., Wu, ZH., Jiang, L., Zheng, QH., Liu, X.: Crawling Deep Web Content Through Query Forms. In Proceedings of WEBIST2009, Lisbon Portugal, pp. 634642 (2009) [9] Lu, J., Wang, Y., Liang, J., Chen, J., Liu J.: An Approach to Deep Web Crawling by Sampling. In Proceedings of IEEE/ WIC/ACM Web Intelligence, Sydney, Australia, pp. 718 724 (2008) [10] M. Ajtai, On the complexity of the pigeonhole principle, Proc. of the 29th FOCS, pp. 346355, 1988 [11] Thomas H. Cormen, Cli_ord Stein, Ronald L. Rivest, and Charles E. Leiserson. Introduction to Algorithms. The MIT Press, 3rd edition, 2009
Fig. 6. Worst Case Complexity Comparison
Four clusters have been used for experimental purpose using tree traversal as shown in Fig.7 using cloud crawler based on IP addresses available in cache. Threads have been utilized to visit distinct hosts in a concurrent manner. There is no need to store data into client node as result is directly sent to crawler server scanning each node. Cloud crawler works with IP addresses of a cache following an m-way tree structure.
Fig. 7. Crawling Results
2013 ACEEE DOI: 01.IJIT.3.1.1114
65

An Efficient Cloud Based Approach For Service Crawling

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Efficient Cloud Based Approach For Service Crawling

Uploaded by

Copyright:

Available Formats

Short Paper ACEEE Int. J. on Information Technology, Vol. 3, No.

An Efficient Cloud based Approach for Service Crawling

Fig. 2. Arbitrary Cloud Cluster Scenario

Fig. 1. Flowchart of Service based Cloud Crawler

Fig. 3. M-Way tree

Fig. 4. Flow chart to reach any node using Fig. 2

The shortest path to reach Node 13 is {1 III. EXPERIMENTAL ANALYSIS

Fig. 6. Worst Case Complexity Comparison

Fig. 7. Crawling Results

2013 ACEEE DOI: 01.IJIT.3.1.1114

You might also like