Enhanced Search in Unstructured Peer-to-Peer Overlay Networks

Enhanced Search in Unstructured Peer-to-Peer Overlay Networks
Chittaranjan Hota1 , Vikram Nunia1 , Mario Di Francesco2,3 , a-J a aski2 Jukka K. Nurminen2 , and Antti Yl
Dept. of Computer Science, BITS Pilani Hyderabad Campus, India {vikram,hota}@bits-hyderabad.ac.in 2 Dept. of Computer Science and Engineering, Aalto University, Finland {mario.di.francesco,jukka.k.nurminen,antti.yla-jaaski}@aalto.fi Dept. of Computer Science and Engineering, University of Texas at Arlington, USA mariodf@uta.edu
1
Abstract. Unstructured Peer-to-Peer (P2P) overlays are the most widely used topologies in P2P systems because of their simplicity and very limited control overhead. A P2P overlay species the logical connections among peers in a network. Such logical links dene the order in which peers are queried in search for a specic resource. The most popular query routing algorithms are based on ooding, thus they do not scale well as each query generates a large amount of trac. In this paper, we use heuristics to improve overlay search in an unstructured P2P le sharing system. The proposed heuristics eectively decide replica locations for popular resources based on the availability of computing and storage at a given peer, its neighborhood information, and the used routing strategy. Simulations performed over two dierent types of unstructured P2P network topologies (i.e., power law and random graphs) show signicant improvements over plain ooding in terms of reduced network trac and search time. Keywords: Peer-to-peer, search, replication, unstructured, overlay networks.
Introduction
In a Peer-to-Peer (P2P) network participating nodes are both providers and users of services. The usage of P2P applications has grown steadily since their initial development, and recent empirical studies indicate that P2P and web together dominate todays Internet trac. As reported in [1], P2P trac accounted for almost 60% of Internet trac worldwide in 2009. Motivated by the extent of their usage, researchers have focused on studying and improving the scalability and performance of P2P networks. In order to facilitate direct data exchange and service execution between different peers, a logical overlay is usually imposed over the underlying physical network. There are two classes of P2P overlay networks: structured, and unstructured [2]. An unstructured P2P system consists of peers joining the network
J.J. Park et al. (Eds.): GPC 2013, LNCS 7861, pp. 270279, 2013. c Springer-Verlag Berlin Heidelberg 2013
271
with some loose rules, without any prior knowledge of the topology. Unstructured P2P networks oer decentralization and simplicity, but may require O(N ) hops to search a le when the network is made of N nodes. In contrast, structured P2P overlay networks tightly control both the network topology and the placement of content. Specically, the content is stored at specied locations based on distributed hash tables (DHTs), so as to improve the eciency of the queries. Structured P2P overlays using DHTs are valuable for large scale distributed applications because of their search eciency which is O(log N ) for a network of N nodes. However, structured P2P overlays are not very suitable for le searching applications exploiting multiple attributes and involving a large number of peers with a high level of churn. In this paper, we describe a novel replication heuristic which exploits a proportional replication policy to reduce search load in an unstructured P2P overlay. The replication heuristic considers resource popularity and also provides a query routing strategy. In our solution, replication is achieved by explicitly pushing resources to other peers. The replication heuristic is also enhanced with an intelligent neighbor selection heuristic. By using a power-law function, a peer selects n neighboring peers. Out of those neighbors, a peer further picks m most preferred peers by using chi-square similarity measure. The search algorithm used in this paper takes a hybrid approach that is a tradeo between ooding, which alone is not inecient and does not scale, and a random walk, which could take long time to nd a resource. In our approach, we used k -random walks for resources with a low number of replicas, and a single random walk for resources with a higher number of replicas. Simulation results show that the search load is fairly distributed among the peers and that the cost to locate a resource in the network is very low. The rest of this paper is organized as follows. Section 2 reviews the search and replication techniques commonly used in unstructured P2P networks. Section 3 details the overlay topology construction, while Sect. 4 presents our replication and neighbor selection heuristics, along with the k -walker search algorithm. Section 4 presents a performance evaluation of our proposed heuristics. Finally, Sect. 6 concludes the work.
Related Work
In order to reduce unnecessary ooding which is, however, a widely used search technique in unstructured P2P three major approaches have been proposed in the literature. In the rst category, each peer uses heuristics to intelligently decide the peer which could likely provide the resource [3, 4]. In this case, the performance of the heuristics determines the search load. In our approach, we extend the search with a replication heuristic that eectively makes more copies of the most popular resources for reducing the search load. In the second category, a peer caches the resource IDs of other peers as a third-party query is routed through them, and uses these IDs to reduce the search load in subsequent requests [57]. The major drawback of indexing is represented by the
272
C. Hota et al.
additional storage requirement at the peers. In our approach, we try to reduce the storage demands at peers by judiciously deciding the number of replicas in a dynamic environment. The third category is based on overlay topology optimization, which has been attempted by many researchers through techniques that include end-system multicast [8] and clustering [9]. Even though clustering may scale, it does not guarantee the search scope. Our approach uses a similarity measure to decide upon similar peers in terms of their resource preferences. As a consequence, it is more likely that search queries will get a better response from well-connected (similar) peers. Most P2P systems only construct peer connections according to the network constraints, and do not take user preferences into consideration. For instance, in [10, 11] social overlays were used to nd out similar peers and build clusters accordingly in order to improve search. In these approaches, a peer selects another peer based on their similar interest in searching les. This social linking reduces search load on other peers as the le requests have a high probability of being fullled by the neighboring peer. In our approach, we used the chi-square statistics used in [10] to compute a similarity measure between two peers. In this work, we extend our earlier work in [12] by using a power-law distribution to compute the node degree and then select the peers which are similar.
Overlay Topology Construction
First, we create a P2P overlay topology, wherein each peer has a certain number of neighbors. We use two types of networks in our simulation: 1. Power-Law Graph (PLG): The node degrees follow a power-law distribution: when ranked from the most connected to the least connected, the i-th most connected node has C/i neighbors, where C is a constant, and is a scaling factor such that 0 < <1. In the following, we will set = 0.7. Once the node degree n is chosen, nodes are connected with the m best neighbors as described later. 2. Random Graph (RG): The node degrees are calculated randomly, and nodes are connected with m best nodes out of n nodes. Many real-life P2P overlay networks are random graphs, i.e., neighbors are selected randomly. A plausible alternative to random overlay networks is to build a network based on a measure of similarity between the users resources [10]. Solutions available in the literature have already exploited social relations to nd out the similarity of peers, such as in [10, 13]. We use an approach such as the one in [10] for computing the similarity measure between two peers. However, in contrast with that work, we use le types instead of style of les. The key observation is that, although the styles of les downloaded by two peers may be the same, the content within the le may be dierent, hence the le type provides a better similarity measure. Each user is identied by a vector denoting the probability of sharing a le of each type. To this end, we rst determine the background probability of a le
273
being of a specic type, based on the type distributions in the entire network. Fj The background probability is obtained as Pb (Fj ) = n , where Fj is the number of resources type j and n total number of les shared in network. We calculate a Fj , similar probability for a user sharing the same type of le, namely, Pu (Fj ) = u where u is total number of les shared by user u. Finally, we also calculate du the sharing probability for a user as Pu (S ) = , where dn is the total data dn shared in network and du is total data shared by user u. Given the number of downloads du and the downloaded amount of data Du of a user, we can then calculate the number of expected les types downloaded by a user for the background probability, the similarity probability and the shared probability, as shown below: Eb (Fj ) = Pb (Fj ) du Eu (Fj ) = Pu (Fj ) du Eb (S ) = Pu (S ) Du By using the expected values computed above, we can then calculate two chisquare statistics to determine how the downloads of a users are are similar to the background type distribution and to their own shared distribution. In detail, it is 2 dFj Ez (Fj ) 2 , z {u, b} (1) = Xz Ez (Fj )
Fj
where dF j is total number of le downloads of type j . By using the dierence between the two statistics, we can determine if a user is more like the network or more like the library of shared les [10]. If the user is more like the network, then the user will be connected with the m neighbors who have shared a large number of les and and a large amount of data by using the probabilities Eb (Fj ) and Eu (S ). Otherwise, user will be connected with m most similar nodes. We dene the expected number of les that a sharer provides to a downloader as: E (u, d) =
Sui
Pd (Fj ) |Fu (fj )|
(2)
where Suj is the number of les of type j shared by user u, Pd (Fi ) is the probability of le type i being downloaded by a peer d, and Fu (fi ) is the set of les shared by user u of a type i not already owned by d. For each downloader we can rank every other user based on the expected number of new les they might provide. Using this ranked list, we can select the m best neighbors for a user.
Algorithms for Replication and Search
In this section we will describe the heuristics behind the search and replication algorithms.
274
C. Hota et al.
Algorithm 1. k -walker le search heuristics

1 2 3 4 5 6 7 8 9 10 11 12 13 14
Search le f in shared folder; if f found then // Call replication algorithm, save nodes and exit f nodes File Replicate(f) F R F R {self } {f nodes} ; return F R to source; if source is self then // Add file to request list and start walkers RL RL {f }; calculate k and create k-walkers; foreach walker w in k do // Forward query to neighbor N randomly select neighbor N ; N.F Search(f ,source,1); else // Check whether the file was found or not if check==CHECK then if source.check le(f )==True then exit; check 0; check check + 1; randomly select neighbor N ; N.F Search(f ,source,check); // Forward query to neighbor N
4.1
Search Heuristic
To avoid the message overhead of ooding, unstructured P2P networks use different types of random walks. In a random walk, a single query message is sent to a randomly selected neighbor. We call this message walker. A walker has a TTL value that is decremented at each hop. If the query nds the desired resource at some node, the search terminates successfully. If the query fails, as determined by timeout or a failure message from the node last receiving the query, the initiating peer chooses another random path. The standard random walk which uses only one walker can cut down the message overhead by one order of magnitude compared to ooding [14]. However, there is also an order of magnitude increase in the delay perceived by the user. To reduce the delay, we increase the number of walkers as in [14, 15]. That is, instead of just sending out one query message, a requesting node sends k query messages in parallel. More walkers nd resources faster, but also generate more trac when the number of replicas in the network is low. Furthermore, when the number of walkers is enough high, increasing it further slightly reduces the number of hops, but signicantly increases the trac. For every search request, the value of k depends on the replication probability calculated at the requesting node as follows. Let Fu be number of requests for a particular le type by user u, and u is total number of requests made by user u. Then, the replication probability is: P(RFu ) = Fu u (3)
The key idea behind the choice of k is that its value should be lower when the replication probability is higher. In other words, k = 1 when a resource has the highest number of replicas, thus implying that only one walker is good enough to locate that resource. On the other hand, k will be maximum if no replica exists. Specically, the value of k is expressed as:
275
k = K (1 P(RFu ))
(4)
where K is a constant that denes the maximum number of walkers to be used in the search. As multiple random walks require some mechanism to terminate, each walker periodically checks with the original requester before walking to the next node. This method still uses a TTL, but the TTL is very large and is mainly used to prevent loops. Since there are a xed number of walkers (1 to K ), the walkers checking back with the requester will not lead to message implosion at the requester node. Of course, checking does have overhead; each check requires a message exchange between a node and the requester node. Indeed, simulation experiments in the next section show that checking once at every Max hop check step along the way achieves a good balance between the message overhead and the benets of checking. The k -walker search heuristics is illustrated by Algorithm 1. 4.2 Replication Heuristics
File replication involves storing replicas of les in nodes other than the one sharing them. Replication improves the query success rate and reduces latency by making the shared les more likely to be available in the path of a search walk. In the following, we propose a proportional replication strategy coupled to the k -random walk search described in the previous section. Since there is no well-known correlation between le popularity and capacity of nodes storing those les, 1-hop replication scheme is biased against les shared by peers with low capacity [16]. In a 1-hop scheme, replicas are stored on immediate neighbors. Our scheme overcomes this problem by replicating popular les at the nodes with high capacity, and by regulating the number of random walks dynamically. More random walkers are used when there are less replicas, while fewer walkers are exploited when there are more replicas in the overlay network. The replication algorithm works as follows. Let be R the maximum number of replicas. In our implementation we use a proportional replication strategy, i.e., les are replicated proportional to the querying rate. If a resource is queried many times, more replicas should exist to reduce the associated search load. When a le f is found, the corresponding peer calculates the number of replicas rf of f to be created as: rf =
RP(Rfn )
if P(Rfn ) < , otherwise.
(5)
In Eq. 5, P(Rfn ) is the replication probability, R the maximum number of replicas and the average replication probability. The replication probability indicates the actual number of replicas to create for a given le f . Specically, P(Rfn ) is obtained as: f P(Rfn ) = n (6) n
276
C. Hota et al.
Algorithm 2. Replication heuristics

1 2 3 4 5 6 7 8 9 10
if owner f != self then return f nodes owner f.File replicate(f); // send request to the owner k 0; i 0; foreach node i which accessed f do calculate for node i; A[k] ; k k + 1; Sort A in decreasing order; Calculate rf ; foreach k in 0 to rf -1 do f nodes[k] A[i]; // Check if res available on f nodes[k] if Check resources(f nodes[k])==False then k k-1; // Ignore it i i+1;
11 12
// Replicate on nodes not having replica of file f foreach node n in {f nodes} {R nodes} do replicate f ; // Delete replica of f from nodes not in f nodes foreach node n in {R nodes} {f nodes} do delete f ;
where fn is the number of requests for le f on node n, and n is total number of requests on node n. Now, node n calculates for each node, and stores them in decreasing order in a sorted array. The value is calculated as the probability that the le to be replicated will be accessed by the peer on which it will be replicated. High probability means that the le has been accessed more times by that node, which will probably access the le more in future too. The probability Af value is then = Afj , where Afj is the number of accesses to the le f by node j , and Af is total number of access of le f . The probability is calculated for each node which accessed le f and stored in decreasing order in an array. The node n will select rst rf nodes from the sorted array which have enough resources to accommodate le f. Here the considered resources include secondary storage space, main memory and CPU load. After checking resources we will replicate le f only on the peers on which it has not been already replicated. Thus, a le f will be deleted from the nodes which have a smaller value of . This makes our algorithm dynamic in nature. The replication heuristics is described in Algorithm 2.
Simulation Results
We performed experiments on a network of 120 peers with power law degree distribution and during the network lifetime degree being constant. There were 100 distinct items or les on each peer, and the same replica was not available at any other peer. To simulate our algorithms, we started with 120 peers and connected them randomly. During an initial transient phase, each peer performed 100 queries and no replication was performed. At the steady state, nodes were connected with the topology construction algorithm explained in Sect. 3. Unless otherwise stated, the number of walkers K was set to 3 and the maximum number of replicas R for each le was set to 3. The terminating condition was checked

8 K=3 K=4 K=5 K=65 8 K=3 K=4 K=5 K=6
277
Average search scope
Average search scope
200
400
600
800
1,000
200
400
600
800
1,000
Queries
Queries
(a) R=5
(b) R=6
Fig. 1. Average search scope for dierent values of maximum replicas

1,500
Number of visited nodes
120 100 80 60 40 20 0 200 400 600 800 1,000 Gnutella K=5 K=6
Trac cost per query
1,000
Gnutella K=5 K=6
500
200
400
600
800
1,000
Queries
Queries
(a)
(b)
Fig. 2. Comparison of the k-walker algorithm and gnutella in terms of (a) visited nodes and (b) trac cost
every two hops in the k -walker searching algorithm. Every 100 queries each peer was disconnected and reconnected to the best neighbors. In our simulation, out of 100 les on every peer, 40 les are music les, 30 are movie les and remaining 30 are miscellaneous les such as data les, pictures, and so on. We generated random requests according to such distribution. For comparison with Gnutella we implemented the ooding algorithm with TTL value large enough to search every resource in network. To evaluate the search eciency of the system, we considered the following metrics: the search scope, as the number of peers/hops a successful walker traverses during a search; the replication ratio, as the ratio of the total les replicated at a given peer to the total number of les on that peer; and the trac cost as the total number of messages generated by the walkers for searching a le and the overhead trac (e.g., asking for resource information or replicas). Simulations conrmed that k walkers after T steps reach roughly the same
278
C. Hota et al.
30 N65 N119
25 100 Queries 500 Queries 1 000 Queries
20
Satised queries (%)

0 1 2 3 4 5 6
Replicated les (%)
20
15
10
10
0 0 1 2 3 4 5 6
Replicated les rf
Number of traversed hops
(a)
(b)
Fig. 3. (a) Replication ratio for nodes N65 and N119. (b) Satised queries for a sample node (N85) as a function of the traversed hops.
number of nodes as one walker after kT steps. Hence, by using k walkers, we can expect to improve the response time by a factor of k . We performed experiments with a dierent number of walkers and plotted the search scope as a function of the number of queries for dierent values of maximum replicas in Fig. 1. We can see that initially search scope is maximum, i.e., any successful walker traverses less hops on the average. As the number of queries increases, more replicas are created, thus decreasing the average search scope. Furthermore, from the gure it emerges that the search scope does not signicantly changes when increasing K from 5 to 6, while the trac clearly increases in the latter case. Finally, we can see that the average search scope does not actually depend on the considered values of R. Therefore, in the following, we will consider K = 5 and R = 6. To measure the eectiveness of our heuristics, we also compared our solution with the standard ooding algorithm used in Gnutella as shown in Fig. 2. The average nodes visited and the trac cost per query are signicantly lower with our proposed approach. Figure 3a shows the replication ratio on two representative nodes namely, N65 and N119 after running 1,000 queries. From the plot we can observe that for N65 only 9% of les are replicated with the maximum factor. Similarly, at node N119 only 5% of les are replicated on 6 nodes, 15% les at 5 nodes and so on. Figure 3b shows the number of satised queries as a function of the hop distance for a sample node, namely, node N85. We can notice that the number of satised queries increases with the number of resources requested, and the increase is more signicant when the number of hops is lower. As a consequence, most queries can be satised within two or three hops in all cases, thus, with low delay.
Conclusion
In this paper, we proposed replication and search heuristics to reduce the load for searching resources in unstructured peer-to-peer (P2P) systems. For the
279
connection phase, we select the best nodes depending on the previous history, and dynamically adapt to the changing requests. After the connection phase, we proposed the k -walker algorithm, which dynamically determines the number of walkers and searches the les with a low network overhead. We used a proportional replication scheme built on top of the popularity of a le that is adaptive by nature. Experimental evaluation has shown that our techniques are eective at improving search eciency.
References
[1] Schulze, H., Mochalski, K.: ipoque GmbH Internet Study (2008/2009), http://www.ipoque.com/sites/default/files/mediafiles/documents/ internet-study-2008-2009.pdf (retrieved January 28, 2013) [2] Tigelaar, A.S., Hiemstra, D., Trieschnigg, D.: Peer-to-peer information retrieval: An overview. ACM Trans. Inf. Syst. 30(2), 9:19:34 (2012) [3] Zhuang, Z., Liu, Y., Xiao, L., Ni, L.: Hybrid periodical ooding in unstructured peer-to-peer networks. In: Proc. of ICPP 2003, pp. 171178 (October 2003) [4] Haribabu, K., Reddy, D., Hota, C., Yl a-J aa ski, A., Tarkoma, S.: Adaptive lookup for unstructured peer-to-peer overlays. In: Proc. of COMSWARE 2008, pp. 776782 (January 2008) [5] Xiao, L., Liu, Y., Ni, L.: Improving unstructured peer-to-peer systems by adaptive connection establishment. IEEE Trans. on Computers 54(9), 10911103 (2005) [6] Haribabu, K., Hota, C., Yl a-J aa ski, A.: Indexing through querying in unstructured peer-to-peer overlay networks. In: Ma, Y., Choi, D., Ata, S. (eds.) APNOMS 2008. LNCS, vol. 5297, pp. 102111. Springer, Heidelberg (2008) [7] Patro, S., Hu, Y.: Transparent query caching in peer-to-peer overlay networks. In: Proc. of Parallel and Distributed Processing Symposium (April 2003) [8] Chu, Y., Rao, S., Seshan, S., Zhang, H.: A case for end system multicast. IEEE Journal on Selected Areas in Communications 20(8), 14561471 (2002) [9] Nakao, A., Peterson, L., Bavier, A.: A routing underlay for overlay networks. In: Proc. of SIGCOMM 2003, pp. 1118 (2003) [10] Fast, A., Jensen, D., Levine, B.N.: Creating social networks to improve peer-topeer networking. In: Proc. of ACM SIGKDD 2005, pp. 568573 (2005) [11] Lin, C.J., Chang, Y.T., Tsai, S.C., Chou, C.F.: Distributed social-based overlay adaptation for unstructured P2P networks. In: IEEE Global Internet Symposium, pp. 16 (May 2007) [12] Hota, C., Nunia, V., Yl a-J aa ski, A.: Distributed algorithms for improving search eciency in peer-to-peer overlays. International Journal of Computer Networks and Information Security 4(3), 17 (2012) [13] Cholvi, V., Felber, P., Biersack, E.: Ecient search in unstructured peer-to-peer networks. In: Proc. of SPAA 2004, pp. 271272 (2004) [14] Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Search and replication in unstructured peer-to-peer networks. In: Proc. of ICS 2002, pp. 8495 (2002) [15] Kitamura, H., Fujita, S.: A biased k-random walk to nd useful les in unstructured peer-to-peer networks. In: 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 210216 (December 2009) [16] Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., Shenker, S.: Making gnutella-like P2P systems scalable. In: Proc. of SIGCOMM 2003, pp. 407418 (2003)

Enhanced Search in Unstructured Peer-to-Peer Overlay Networks

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Enhanced Search in Unstructured Peer-to-Peer Overlay Networks

Uploaded by

Copyright:

Available Formats

Enhanced Search in Unstructured Peer-to-Peer Overlay Networks