Professional Documents
Culture Documents
F14Lec12graphs PDF
F14Lec12graphs PDF
F14Lec12graphs PDF
Graph Analytics
CS194-16 Introduction to Data Science
Joseph E. Gonzalez
Post-doc, AMPLab
jegonzal@cs.berkeley.edu
Vertices
• Users
• Posts / Images
Edges
• Social Relationships
• Directed: Twitter
• Undirected: Facebook
• Likes
Actual Social Graph CHAPTER 1. OVERVIE
27 23
15
10 20
16 4
31 13
11
30 34 14
6
1 12 17
9
21 33
7
29 3
18 5
22
19 2
28
25
8
24
32
26
Karate
e 1.7: From the social network Club Network
of friendships in t he karat e club from Figure 1.1,
nd clues to t he latent schism t hat event ually split the group int o two separat e clu
Web Graphs
Wikipedia restricted to
1000 climate change
pages
Call records
Vertices: Users
Directed Edges: Email FromTo
User - Item Graphs
(Recommender Systems)
Bipartite Graphs
Vertices: Users and Items
Edges: Ratings
Graphical Models
Vertices: Random Variables, Factors
Edges: Statistical Dependencies
Cat
Apple
Growth
Hat
LDA Plant
Co-Authorship Network
Vertices: Authors
Edges: Co-authorship
Example: Erdos
Number
http://academic.research.microsoft.com/VisualExplorer#2952384&1112639
Others?
Common properties of
graphs derived from
natural phenomena
Power-Law Degree
10
10 Distribution
More than 108 vertices
have one neighbor.
of Vertices
8
10
TopHigh-Degree
1% of vertices are
6
10 adjacent to
Vertices
Numbercount
2
10
AltaVista WebGraph
0 1.4B Vertices, 6.6B Edges
10 0 2 4 6 8
10 10 10 10 10
Degree
degree 20
Giant Connected
Component
Densification
Facebook US Patent Citations
200
Ratio of Edges to Vertices
180
160
140
120
100
80
60
2008 2010 2012
Year
22
Community Structure
Linked-In Messenger
Graph Algorithms
“Think Globally, Act Locally”
Identifying Leaders
25
PageRank (Centrality
Measures)
Recursive Relationship:
Where:
» α is the random reset probability (typically 0.15)
» L[j] is the number of links on page j
1 2 3
4 5 6
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Predicting Behavior
? ?
Liberal ? ? Conservative
?
?
? ?
?
? Post
Post
?
?
Post Post
? Post Post
? Post
Post
Post Post ?
?
Post
? ?
? ? ?
Post Post ?
Post
Post Post
? ? ? ?
? ?
? ?
27
Label Propagation
(Structured Prediction)
Sue Ann
Social Arithmetic: 40%
80% Cameras
20% Biking
50% What I list on my profile
40% Sue Ann Likes
+ 10% Carlos Like
50%
Profile
50% Cameras
Me
I Like: 60% Cameras, 40% Biking
50% Biking
Recurrence Algorithm:
Carlos
Likes[i] = å Wij ´ Likes[ j] 10% 30% Cameras
jÎFriends[i] 70% Biking
http://www.cs.cmu.edu/~zhuxj/pub/CMU-CALD-02-107.pdf
Recommending Products
Users Ratings Item
s
Recommending Products
Low-Rank Matrix Factorization:
f(3)
r13
Movie f(1)
≈ User
Netflix s f(j) r14
s
f(4)
s f(i) r24
f(2)
Movie r25 f(5)
s
Iterate:
31
Finding Communities
Count triangles passing through each
vertex:
2 3
1
4
E B
C
A
1 C
C C
A B D
2 D D
E
D G
F F
G
Connected Components
Every vertex starts out with a unique
component id (typically it’s vertex id):
1 5 1 4 1 4
2 4 1 4 1 4
3 6 2 4 1 4
Putting it All Together
Hyperlinks PageRank Top 20 Page
Title PR
Raw Text
Wikipedia Table
Title Body
<</ />> Term-Doc Topic Model
</>
XML
Graph (LDA) Word Topics
Word Topic
Model / Alg.
State
Fundamental Pattern
37
Graph-Parallel Systems
Barrier
Graph-Parallel Systems
42
Program Run on This
This Machine 1 Machine 2
Gather Y’
Y’Y’
Y’
Σ1 + Σ
+ + Σ2 Mirror
Apply Y
Σ3 Σ4
Scatter
Mirror
Mirror
Machine 3 Machine 4 44
vi 2D Partitioning
Vertices
1 2 3 4
16 Machines
vi
5 6 7 8 vi only has
Vertices
Adj. neighbors on
9
Matrix
10 11 12
7 machines
13 14 15 16
45
Triangle Counting on Twitter
40M Users, 1.4 Billion Links
jegonzal@eecs.berkeley.edu
Graph Analytics Pipeline
Hyperlinks PageRank Top 20 Page
Title PR
Raw Text
Wikipedia Table
Title Body
<</ />> Term-Doc Topic Model
</>
XML
Graph (LDA) Word Topics
Word Topic
Tables Graphs
Separate Systems
Dataflow Systems
Graphs
Table
Row
Row
Resul
t
Row
Row
Separate Systems
Dataflow Systems Graph Systems
Table Dependency
Row Graph
Row
Resul
t
Row
Row
Separate systems
for each view can be
difficult to use and
inefficient
58
Difficult to Program and Use
Users must Learn, Deploy, and
Manage multiple systems
<</ />>
</>
XML
Dataflow
Distributed Horizontally Vertex Operators
Graphs Partitioned Tables Programs
Optimizations
Advances in Graph Processing Systems
Distributed Join
Optimization
Materialized View
Maintenance
View a Graph as a Table
Vertex Property Table
Property Graph Id Property (V)
Rxin (Stu., Berk.)
R F Jegonzal (PstDoc, Berk.)
Franklin (Prof., Berk)
Istoica (Prof., Berk)
69
The GraphX Stack
(Lines of Code)
PageRank Connected K-core Triangl
(20) Comp. (20) (60) e LDA SVD++
Count (220) (110)
Pregel API (34) (50)
GraphX (2,500)
Spark (30,000)
71
Enhanced Pregel in GraphX
Require Message
pregelPR(i, messageList ):
messageSum Combiners
// Receive all the messages
total = 0 messageSum
foreach( msg in messageList) :
total = total + msg
B B 1 B C
C D
A
A D
D C C 1
2D Vertex
A Cut Heuristic
D
A E
D D 1 2
A F
E E 2
F E E D
Part. 2 F F 2 E F
Caching for Iterative mrTriplets
Vertex Edge Table
Table (RDD)
(RDD) Mirror A B
Cache
A
A
A A C
B
B
B B C
C
D C D
C
C
Mirror
Cache
A E
D
D
A A F
E
E D
E E D
FF F E F
Incremental Updates for Iterative
mrTriplets
Vertex Edge Table
Table (RDD)
(RDD) Mirror A B
Cache
Change A
A A C
B
B B C
C
D C D
C
Mirror
Cache
A E
D
A A F
Scan
Change E D
E E D
F F E F
Aggregation for Iterative mrTriplets
Vertex Edge Table
Table (RDD)
(RDD) Mirror A B
Cache
Change A
A A C
Local
B
Change B Aggregate B C
C
D C D
Change C
Mirror
Cache
A E
Change D
A A F
Scan
Local
Change E D
Aggregate
E E D
Change F F E F
Performance Comparisons
Live-Journal: 69 Million Edges
Mahout/Hadoop 1340
Naïve Spark 354
Giraph 207
GraphX 68
GraphLab 22
0 200 400 600 800 1000 1200 1400 1600
GraphX 451
GraphLab 203
jegonzal@eecs.berkeley.edu
About Scala
High-level language for the Java VM
» Object-oriented + functional programming
Statically typed
» Comparable in speed to Java
» But often no need to write types due to type
inference
println(lst(5)) System.out.println(lst.get(5));
Quick Tour
Processing collections with functional
programming: Function expression (closure)
val list = List(1, 2, 3)