Just A Test Document

Search and Replication in
Unstructured Peer-to-Peer
Networks
Pei Cao
Cisco Systems, Inc.
(Joint work with Christine Lv, Edith

Cohen, Kai Li and Scott Shenker)
Disclaimer
• Results, statements, opinions in this talk do
not represent Cisco in anyway
• This talk is about technical problems in
networking, and does not discuss moral,
legal and other issues related to P2P
networks and their applications
Outline
• Brief survey of P2P architectures
• Evaluation methodologies
• Search methods
• Replication strategies and analysis
• Simulation results
Characteristics of Peer-to-Peer
Networks
• Unregulated overlay network
• Current application: file swapping
• Dynamic: nodes join or leave frequently
• Example systems:
– Napster, Gnutella;
– Freenet, FreeHaven, MajoNation, Alpine, ...
– JXTA, Ohaha, …
– Chord, CAN, “Past”, “Tapestry”, Oceanstore
Architecture Comparisons
• Napster: centralized
– A central website to hold file directory of all participants;
Very efficient
– Scales
– Problem: Single point of failure
• Gnutella: decentralized
– No central directory; use “flooding w/ TTL”
– Very resilient against failure
– Problem: Doesn’t scale
• Various research projects such as CAN:
decentralized, but “structured”
– CAN: distributed hash table
– “Structure”: all nodes participate in a precise
scheme to maintain certain invariants
– Extra work when nodes join and leave
– Scales very well, but can be fragile
• FreeNet: decentralized, but semi-structured
– Intended for file storage
– Files are stored along a route biased by hints
– Queries for files follow a route biased by the
same hints
– Scales very well
– Problem: would it really work?
• Simulation says yes in most cases, but no proof so
far
Our Focus: Gnutella-Style
Systems
• Advantages of Gnutella:
– Support more flexible queries
• Typically, precise “name” search is a small portion
of all queries
– Simplicity, high resilience against node failures
• Problems of Gnutella: Scalability
– Bottleneck: interrupt rates on individual nodes
– Self-limiting network: nodes have to exit to get
real work done!
Evaluation Methodologies
Simulation based:
• Network topology
• Distribution of object popularity
• Distribution of replication density of objects
Evaluation Methods
• Network topologies:
– Uniform Random Graph (Random)
• Average and median node degree is 4
– Power-Law Random Graph (PLRG)
• max node degree: 1746, median: 1, average: 4.46
– Gnutella network snapshot (Gnutella)
• Oct 2000 snapshot
• max degree: 136, median: 2, average: 5.5
– Two-dimensional grid (Grid)
Modeling Methods
• Object popularity distribution pi
– Uniform
– Zipf-like
• Object replication density distribution ri
– Uniform
– Proportional: ri  pi
– Square-Root: ri   pi
Evaluation Metrics
• Overhead: average # of messages per node
per query
• Probability of search success: Pr(success)
• Delay: # of hops till success
Load on Individual Nodes
• Why is a node interrupted:
– To process a query
– To route the query to other nodes
– To process duplicated queries sent to it
Duplication in Flooding-Based
Searches
1
2 3 4
5 6 7 8
. . . . . . . . . . . .
• Duplication increases as TTL increases in flooding

• Worst case: a node A is interrrupted by N * q * degree(A)
messages
Duplications in Various Network
Topologies
Flooding: % duplicate msgs vs TTL
100
duplicate msgs (%)
80
Random
60 PLRG
40 Gnutella
Grid
20
0
2 3 4 5 6 7 8 9
TTL
Relationship between TTL and
Search Successes
Flooding: Pr(success) vs TTL
120
100
Pr(success) %
Random
80
PLRG
60
Gnutella
40
Grid
20
0
2 3 4 5 6 7 8 9
TTL
Problems with Simple TTL-
Based Flooding
• Hard to choose TTL:
– For objects that are widely present in the
network, small TTLs suffice
– For objects that are rare in the network, large
TTLs are necessary
• Number of query messages grow
exponentially as TTL grows
Idea #1: Adaptively Adjust TTL
• “Expanding Ring”
– Multiple floods: start with TTL=1; increment
TTL by 2 each time until search succeeds
• Success varies by network topology
– For “Random”, 30- to 70- fold reduction in
message traffic
– For Power-law and Gnutella graphs, only
3- to 9- fold reduction
Limitations of Expanding Ring
Flooding: #nodes visited vs TTL
12000
10000
#nodes visited
Random
8000
PLRG
6000
Gnutella
4000
Grid
2000
0
2 3 4 5 6 7 8 9
TTL
Idea #2: Random Walk
• Simple random walk
– takes too long to find anything!
• Multiple-walker random walk
– N agents after each walking T steps visits as
many nodes as 1 agent walking N*T steps
– When to terminate the search: check back with
the query originator once every C steps
Search Traffic Comparison
avg. # msgs per node per query
3 2.85
2.5
2 1.863
1.5
0.961
1
0.5 0.053 0.027 0.031
0
Random Gnutella
Flood Ring Walk

Search Delay Comparison
# hops till success
10 9.12
8 7.3
6
4.03
4 3.4
2.51 2.39
2
0
Random Gnutella
Flood Ring Walk

Lessons Learnt about Search
Methods
• Adaptive termination
• Minimize message duplication
• Small expansion in each step
Flexible Replication
• In unstructured systems, search success is
essentially about coverage: visiting enough nodes
to probabilistically find the object => replication
density matters
• Limited node storage => what’s the optimal
replication density distribution?
– In Gnutella, only nodes who query an object store it =>
ri  pi
– What if we have different replication strategies?
Optimal ri Distribution
• Goal: minimize ( pi/ ri ), where  ri =R
• Calculation:
– introduce Lagrange multiplier , find ri and 
that minimize:
( pi/ ri ) +  * ( ri - R)
=>  - pi/ ri2 = 0 for all i
=> ri   p i
Square-Root Distribution
• General principle: to minimize ( pi/ ri )
under constraint  ri =R, make ri
propotional to square root of pi
• Other application examples:
– Bandwidth allocation to minimize expected
download times
– Server load balancing to minimize expected
request latency
Achieving Square-Root
Distribution
• Suggestions from some heuristics
– Store an object at a number of nodes that is
proportional to the number of node visited in order to
find the object
– Each node uses random replacement
• Two implementations:
– Path replication: store the object along the path of a
successful “walk”
– Random replication: store the object randomly among
nodes visited by the agents
Evaluation of Replication
Methods
• Metrics
– Overall message traffic
– Search delay
• Dynamic simulation
– Assume Zipf-like object query probability
– 5 query/sec Poisson arrival
– Results are during 5000sec-9000sec
Distribution of ri
Replication Distribution: Path Replication
1
1 10 100
replication ratio
(normalized)
0.1
0.01
real result
square root
0.001
object rank
Total Search Message
Comparison
Avg. # msgs per node (5000-9000sec)
60000
50000
40000 Owner Rep
30000 Path Rep
20000 Random Rep
10000
0
• Observation: path replication is slightly

inferior to random replication
Search Delay Comparison
Dynamic simulation: Hop Distribution
(5000~9000s)
120
100
queries finished (%)
80
60 Owner Replication
40 Path Replication
Random Replication
20
0
1 2 4 8 16 32 64 128 256
#hops
Summary
• Multi-walker random walk scales much
better than flooding
– It won’t scale as perfectly as structured
network, but current unstructured network can
be improved significantly
• Square-root replication distribution is
desirable and can be achieved via path
replication

Just A Test Document

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Just A Test Document

Uploaded by

Copyright:

Available Formats

Search and Replication in

(Joint work with Christine Lv, Edith

• Duplication increases as TTL increases in flooding

Flood Ring Walk

Flood Ring Walk

• Observation: path replication is slightly

You might also like