Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 32

Search and Replication in

Unstructured Peer-to-Peer
Networks

Pei Cao
Cisco Systems, Inc.

(Joint work with Christine Lv, Edith


Cohen, Kai Li and Scott Shenker)
Disclaimer
• Results, statements, opinions in this talk do
not represent Cisco in anyway
• This talk is about technical problems in
networking, and does not discuss moral,
legal and other issues related to P2P
networks and their applications
Outline
• Brief survey of P2P architectures
• Evaluation methodologies
• Search methods
• Replication strategies and analysis
• Simulation results
Characteristics of Peer-to-Peer
Networks
• Unregulated overlay network
• Current application: file swapping
• Dynamic: nodes join or leave frequently
• Example systems:
– Napster, Gnutella;
– Freenet, FreeHaven, MajoNation, Alpine, ...
– JXTA, Ohaha, …
– Chord, CAN, “Past”, “Tapestry”, Oceanstore
Architecture Comparisons
• Napster: centralized
– A central website to hold file directory of all participants;
Very efficient
– Scales
– Problem: Single point of failure
• Gnutella: decentralized
– No central directory; use “flooding w/ TTL”
– Very resilient against failure
– Problem: Doesn’t scale
Architecture Comparisons
• Various research projects such as CAN:
decentralized, but “structured”
– CAN: distributed hash table
– “Structure”: all nodes participate in a precise
scheme to maintain certain invariants
– Extra work when nodes join and leave
– Scales very well, but can be fragile
Architecture Comparisons
• FreeNet: decentralized, but semi-structured
– Intended for file storage
– Files are stored along a route biased by hints
– Queries for files follow a route biased by the
same hints
– Scales very well
– Problem: would it really work?
• Simulation says yes in most cases, but no proof so
far
Our Focus: Gnutella-Style
Systems
• Advantages of Gnutella:
– Support more flexible queries
• Typically, precise “name” search is a small portion
of all queries
– Simplicity, high resilience against node failures
• Problems of Gnutella: Scalability
– Bottleneck: interrupt rates on individual nodes
– Self-limiting network: nodes have to exit to get
real work done!
Evaluation Methodologies
Simulation based:
• Network topology
• Distribution of object popularity
• Distribution of replication density of objects
Evaluation Methods
• Network topologies:
– Uniform Random Graph (Random)
• Average and median node degree is 4
– Power-Law Random Graph (PLRG)
• max node degree: 1746, median: 1, average: 4.46
– Gnutella network snapshot (Gnutella)
• Oct 2000 snapshot
• max degree: 136, median: 2, average: 5.5
– Two-dimensional grid (Grid)
Modeling Methods
• Object popularity distribution pi
– Uniform
– Zipf-like
• Object replication density distribution ri
– Uniform
– Proportional: ri  pi
– Square-Root: ri   pi
Evaluation Metrics
• Overhead: average # of messages per node
per query
• Probability of search success: Pr(success)
• Delay: # of hops till success
Load on Individual Nodes
• Why is a node interrupted:
– To process a query
– To route the query to other nodes
– To process duplicated queries sent to it
Duplication in Flooding-Based
Searches
1
2 3 4
5 6 7 8

. . . . . . . . . . . .

• Duplication increases as TTL increases in flooding


• Worst case: a node A is interrrupted by N * q * degree(A)
messages
Duplications in Various Network
Topologies
Flooding: % duplicate msgs vs TTL

100
duplicate msgs (%)

80
Random
60 PLRG
40 Gnutella
Grid
20

0
2 3 4 5 6 7 8 9
TTL
Relationship between TTL and
Search Successes
Flooding: Pr(success) vs TTL

120
100
Pr(success) %

Random
80
PLRG
60
Gnutella
40
Grid
20
0
2 3 4 5 6 7 8 9
TTL
Problems with Simple TTL-
Based Flooding
• Hard to choose TTL:
– For objects that are widely present in the
network, small TTLs suffice
– For objects that are rare in the network, large
TTLs are necessary
• Number of query messages grow
exponentially as TTL grows
Idea #1: Adaptively Adjust TTL
• “Expanding Ring”
– Multiple floods: start with TTL=1; increment
TTL by 2 each time until search succeeds
• Success varies by network topology
– For “Random”, 30- to 70- fold reduction in
message traffic
– For Power-law and Gnutella graphs, only
3- to 9- fold reduction
Limitations of Expanding Ring
Flooding: #nodes visited vs TTL

12000
10000
#nodes visited

Random
8000
PLRG
6000
Gnutella
4000
Grid
2000
0
2 3 4 5 6 7 8 9
TTL
Idea #2: Random Walk
• Simple random walk
– takes too long to find anything!
• Multiple-walker random walk
– N agents after each walking T steps visits as
many nodes as 1 agent walking N*T steps
– When to terminate the search: check back with
the query originator once every C steps
Search Traffic Comparison
avg. # msgs per node per query

3 2.85

2.5
2 1.863

1.5
0.961
1
0.5 0.053 0.027 0.031
0
Random Gnutella

Flood Ring Walk


Search Delay Comparison
# hops till success

10 9.12
8 7.3

6
4.03
4 3.4
2.51 2.39
2
0
Random Gnutella

Flood Ring Walk


Lessons Learnt about Search
Methods
• Adaptive termination
• Minimize message duplication
• Small expansion in each step
Flexible Replication
• In unstructured systems, search success is
essentially about coverage: visiting enough nodes
to probabilistically find the object => replication
density matters
• Limited node storage => what’s the optimal
replication density distribution?
– In Gnutella, only nodes who query an object store it =>
ri  pi
– What if we have different replication strategies?
Optimal ri Distribution
• Goal: minimize ( pi/ ri ), where  ri =R
• Calculation:
– introduce Lagrange multiplier , find ri and 
that minimize:
( pi/ ri ) +  * ( ri - R)
=>  - pi/ ri2 = 0 for all i
=> ri   p i
Square-Root Distribution
• General principle: to minimize ( pi/ ri )
under constraint  ri =R, make ri
propotional to square root of pi
• Other application examples:
– Bandwidth allocation to minimize expected
download times
– Server load balancing to minimize expected
request latency
Achieving Square-Root
Distribution
• Suggestions from some heuristics
– Store an object at a number of nodes that is
proportional to the number of node visited in order to
find the object
– Each node uses random replacement
• Two implementations:
– Path replication: store the object along the path of a
successful “walk”
– Random replication: store the object randomly among
nodes visited by the agents
Evaluation of Replication
Methods
• Metrics
– Overall message traffic
– Search delay
• Dynamic simulation
– Assume Zipf-like object query probability
– 5 query/sec Poisson arrival
– Results are during 5000sec-9000sec
Distribution of ri
Replication Distribution: Path Replication

1
1 10 100
replication ratio
(normalized)

0.1

0.01

real result
square root
0.001
object rank
Total Search Message
Comparison
Avg. # msgs per node (5000-9000sec)
60000
50000
40000 Owner Rep
30000 Path Rep
20000 Random Rep

10000
0

• Observation: path replication is slightly


inferior to random replication
Search Delay Comparison
Dynamic simulation: Hop Distribution
(5000~9000s)

120
100
queries finished (%)

80
60 Owner Replication

40 Path Replication
Random Replication
20
0
1 2 4 8 16 32 64 128 256
#hops
Summary
• Multi-walker random walk scales much
better than flooding
– It won’t scale as perfectly as structured
network, but current unstructured network can
be improved significantly
• Square-root replication distribution is
desirable and can be achieved via path
replication

You might also like