F14Lec12graphs PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 85

Introduction to

Graph Analytics
CS194-16 Introduction to Data Science

Joseph E. Gonzalez
Post-doc, AMPLab
jegonzal@cs.berkeley.edu

*These slides are best viewed in PowerPoint with anima


Outline
1. Graph structured data
2. Common properties of graph data
3. Graph algorithms
4. Systems for large-scale graph
computation
5. GraphX: Graph Computation in Spark
6. Summary of other graph frameworks
Graph structured data is
everywhere …
Social Network

Vertices
• Users
• Posts / Images

Edges
• Social Relationships
• Directed: Twitter
• Undirected: Facebook
• Likes
Actual Social Graph CHAPTER 1. OVERVIE

27 23
15

10 20

16 4
31 13
11
30 34 14
6

1 12 17
9
21 33
7
29 3
18 5

22
19 2
28
25
8
24
32

26

Karate
e 1.7: From the social network Club Network
of friendships in t he karat e club from Figure 1.1,
nd clues to t he latent schism t hat event ually split the group int o two separat e clu
Web Graphs
Wikipedia restricted to
1000 climate change
pages

• Vertices: Web-pages Generated Content:


• Edges: Links (Directed) • Click-streams
Web Graphs
2004 Political Blogs

• Vertices: Web-pages Generated Content:


• Edges: Links (Directed) • Click-streams
Semantic Networks
Organize Knowledge
Vertices: Subject,
Object
Edges: Predicates
Example:
Google Knowledge Graph
• 570M Vertices
• 18B Edges
http://wiki.dbpedia.org
Transaction Networks
Supply Chain:
Vertices: Suppliers/Consumers
Edges: Exchange of Goods

Transaction Networks (e.g., Bitcoin):


Vertices: Users
Edges: Exchange of Currency
http://anonymity-in-bitcoin.blogspot.com/2011/07/bitcoin-is-not-
Transaction Networks
Supply Chain:
Vertices: Suppliers/Consumers
Edges: Exchange of Goods

Transaction Networks (e.g., Bitcoin):


Vertices: Users
Edges: Exchange of Currency
Biological Networks
Protein-Protein Interaction Networks (Interactomes)
Vertices: Proteins
Edges: Interactions
Biological Networks
Regulatory Networks
(Bipartite)
Vertices: Regulators, targets
Edges: Regulates target
Communication Networks

Email

Call records

Vertices: Devices, Routers


Directed Edges: Network Flows
Who Talks to Whom
Graph
Enron Email Graphs

Vertices: Users
Directed Edges: Email FromTo
User - Item Graphs
(Recommender Systems)

Bipartite Graphs
Vertices: Users and Items
Edges: Ratings
Graphical Models
Vertices: Random Variables, Factors
Edges: Statistical Dependencies

Cat
Apple
Growth
Hat
LDA Plant
Co-Authorship Network
Vertices: Authors
Edges: Co-authorship

Example: Erdos
Number

http://academic.research.microsoft.com/VisualExplorer#2952384&1112639
Others?
Common properties of
graphs derived from
natural phenomena
Power-Law Degree
10
10 Distribution
More than 108 vertices
have one neighbor.
of Vertices

8
10

TopHigh-Degree
1% of vertices are
6
10 adjacent to
Vertices
Numbercount

50% of the edges!


4
10

2
10
AltaVista WebGraph
0 1.4B Vertices, 6.6B Edges
10 0 2 4 6 8
10 10 10 10 10
Degree
degree 20
Giant Connected
Component
Densification
Facebook US Patent Citations
200
Ratio of Edges to Vertices

180
160
140
120
100
80
60
2008 2010 2012
Year

Average distance between nodes reduces over time.

22
Community Structure

Linked-In Messenger
Graph Algorithms
“Think Globally, Act Locally”
Identifying Leaders

25
PageRank (Centrality
Measures)
Recursive Relationship:

Where:
» α is the random reset probability (typically 0.15)
» L[j] is the number of links on page j
1 2 3

4 5 6
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Predicting Behavior
? ?

Liberal ? ? Conservative
?
?
? ?

?
? Post

Post
?
?
Post Post

? Post Post
? Post

Post
Post Post ?
?
Post
? ?
? ? ?
Post Post ?
Post
Post Post
? ? ? ?
? ?
? ?
27
Label Propagation
(Structured Prediction)
Sue Ann
Social Arithmetic: 40%
80% Cameras
20% Biking
50% What I list on my profile
40% Sue Ann Likes
+ 10% Carlos Like
50%
Profile
50% Cameras

Me
I Like: 60% Cameras, 40% Biking
50% Biking

Recurrence Algorithm:
Carlos
Likes[i] = å Wij ´ Likes[ j] 10% 30% Cameras
jÎFriends[i] 70% Biking

» iterate until convergence

http://www.cs.cmu.edu/~zhuxj/pub/CMU-CALD-02-107.pdf
Recommending Products
Users Ratings Item
s
Recommending Products
Low-Rank Matrix Factorization:
f(3)
r13
Movie f(1)

User Factors (U)

Movie Factors (M)


x
User

≈ User
Netflix s f(j) r14
s

f(4)
s f(i) r24
f(2)
Movie r25 f(5)
s

Iterate:

31
Finding Communities
Count triangles passing through each
vertex:
2 3
1
4

Measure “cohesiveness” of local community


2 * #Triangles[i]
ClusterCoeff[i] =
Deg[i] * (Deg[i] – 1)
Counting Triangles
Count triangles passing through each vertex
by counting triangles on each edge:

E B
C
A
1 C
C C
A B D
2 D D
E
D G
F F

G
Connected Components
Every vertex starts out with a unique
component id (typically it’s vertex id):

1 5 1 4 1 4

2 4 1 4 1 4

3 6 2 4 1 4
Putting it All Together
Hyperlinks PageRank Top 20 Page
Title PR
Raw Text
Wikipedia Table
Title Body
<</ />> Term-Doc Topic Model
</>
XML
Graph (LDA) Word Topics
Word Topic

Discussion Community User Community


Table Editor Graph Detection Community Topic
User Disc. User Com. Topic Com.
Many Other Graph Algorithms
• Collaborative Filtering – CoEM
– Alternating Least Squares • Community Detection
– Stochastic Gradient – Triangle-Counting
Descent – K-core Decomposition
– Tensor Factorization – K-Truss
• Structured Prediction • Graph Analytics
– Loopy Belief Propagation – PageRank
– Max-Product Linear – Personalized PageRank
Programs
– Shortest Path
– Gibbs Sampling
– Graph Coloring
• Semi-supervised ML • Classification
– Graph SSL – Neural Networks
36
The Graph-Parallel Pattern

Model / Alg.
State

Fundamental Pattern

37
Graph-Parallel Systems

Expose specialized APIs to simplify


graph programming.
38
The Vertex Program Abstraction
Vertex-Programs interact by sending messages.
Pregel_PageRank(i, messages) :
i
// Receive all the messages
total = 0
foreach( msg in messages) :
total = total + msg

// Update the rank of this vertex


R[i] = 0.15 + total

// Send new messages to neighbors


foreach(j in out_neighbors[i]) :
Send msg(R[i]) to vertex j

Malewicz et al. [PODC’09, SIGMOD’10] 39


Iterative Bulk Synchronous
Execution
Compute Communicate

Barrier
Graph-Parallel Systems

Exploit graph structure to achieve


orders-of-magnitude performance gains
over more general data-parallel
systems. 41
Graph System
Optimizations
Specialized Vertex-Cuts Remote
Data-Structures Partitioning Caching / Mirroring

Message Active Set Tracking


Combiners

42
Program Run on This
This Machine 1 Machine 2

Split High-Degree vertices


New Abstraction  Equivalence on Split 43
GAS Decomposition
Machine 1 Machine 2
Master

Gather Y’
Y’Y’
Y’

Σ1 + Σ
+ + Σ2 Mirror
Apply Y

Σ3 Σ4
Scatter
Mirror
Mirror

Machine 3 Machine 4 44
vi 2D Partitioning
Vertices

1 2 3 4
16 Machines
vi
5 6 7 8 vi only has
Vertices

Adj. neighbors on

9
Matrix
10 11 12
7 machines

13 14 15 16
45
Triangle Counting on Twitter
40M Users, 1.4 Billion Links

Counted: 34.8 Billion


Triangles
Hadoop 1536 Machines
[WWW’11] 423 Minutes
64 Machines 1000 x
15 Seconds Faster
50
S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11
Break!
http://tinyurl.com/ampgraphx

jegonzal@eecs.berkeley.edu
Graph Analytics Pipeline
Hyperlinks PageRank Top 20 Page
Title PR
Raw Text
Wikipedia Table
Title Body
<</ />> Term-Doc Topic Model
</>
XML
Graph (LDA) Word Topics
Word Topic

Discussion Community User Community


Table Editor Graph Detection Community Topic
User Disc. User Com. Topic Com.
Tables
Hyperlinks PageRank Top 20 Page
Title PR
Raw Text
Wikipedia Table
Title Body
<</ />> Term-Doc Topic Model
</>
XML
Graph (LDA) Word Topics
Word Topic

Discussion Community User Community


Table Editor Graph Detection Community Topic
User Disc. User Com. Topic Com.
Graphs
Hyperlinks PageRank Top 20 Page
Title PR
Raw Text
Wikipedia Table
Title Body
<</ />> Term-Doc Topic Model
</>
XML
Graph (LDA) Word Topics
Word Topic

Discussion Community User Community


Table Editor Graph Detection Community Topic
User Disc. User Com. Topic Com.
Separate Systems

Tables Graphs
Separate Systems
Dataflow Systems
Graphs
Table
Row

Row
Resul
t
Row

Row
Separate Systems
Dataflow Systems Graph Systems

Table Dependency
Row Graph

Row
Resul
t
Row

Row
Separate systems
for each view can be
difficult to use and
inefficient

58
Difficult to Program and Use
Users must Learn, Deploy, and
Manage multiple systems

Leads to brittle and often


complex interfaces
59
Inefficient
Extensive data movement and duplication across
the network and file system

<</ />>
</>
XML

HDFS HDFS HDFS HDFS

Limited reuse internal data-structures


across stages
60
The GraphX Unified Approach
New API New System
Blurs the distinction Combines Data-Parallel
between Tables and Graph-Parallel Systems
Graphs

Enabling users to easily and efficiently


express the entire graph analytics
pipeline
Representation
Join

Dataflow
Distributed Horizontally Vertex Operators
Graphs Partitioned Tables Programs

Optimizations
Advances in Graph Processing Systems
Distributed Join
Optimization
Materialized View
Maintenance
View a Graph as a Table
Vertex Property Table
Property Graph Id Property (V)
Rxin (Stu., Berk.)
R F Jegonzal (PstDoc, Berk.)
Franklin (Prof., Berk)
Istoica (Prof., Berk)

Edge Property Table


SrcId DstId Property (E)
J I rxin jegonzal Friend
franklin rxin Advisor
istoica franklin Coworker
franklin jegonzal PI
Spark Table Operators
Table (RDD) operators are inherited from
Spark:
map reduce sample
filter count take
groupBy fold first
sort reduceByKey partitionBy
union groupByKey mapWith
join cogroup pipe
leftOuterJoin cross save
rightOuterJoin zip ...
64
Graph Operators (Scala)
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views -----------------
def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E]
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ----------------------------------
def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
65
}
Graph Operators (Scala)
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views -----------------
def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E]
capture the Gather-Scatter pattern from
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------
specialized graph-processing systems
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ----------------------------------
def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
66
}
Triplets Join Vertices and
Edges
The triplets operator joins vertices and
edges:

Vertices Triplets Edges


A A B
B A C A C
C B C B C
D C D C D
Map-Reduce Triplets
Map-Reduce triplets collects information
about the neighborhood of each vertex:
Src. or Dst.

MapFunction(A B )  (B, ) Reduce


(B, )
MapFunction(A C )  (C, )
(C, + )
MapFunction(B C )  (C, )
(D, )
MapFunction(C D )  (D, ) Message
Combiners
Using these basic GraphX operators
we implemented Pregel and GraphLab
in under 50 lines of code!

69
The GraphX Stack
(Lines of Code)
PageRank Connected K-core Triangl
(20) Comp. (20) (60) e LDA SVD++
Count (220) (110)
Pregel API (34) (50)

GraphX (2,500)
Spark (30,000)

Some algorithms are more naturally expressed


using the GraphX primitive operators
We express enhanced Pregel and
GraphLab
abstractions using the GraphX operators
in less than 50 lines of code!

71
Enhanced Pregel in GraphX
Require Message
pregelPR(i, messageList ):
messageSum Combiners
// Receive all the messages
total = 0 messageSum
foreach( msg in messageList) :
total = total + msg

// Update the rank of this vertex


R[i] = 0.15 + total
combineMsg(a, b): Remove Message
// Compute
// Send summessages
new of two messages
to neighbors
sendMsg(ij,
return R[i], R[j], E[i,j]):
a + b in out_neighbors[i]) :
Computation
foreach(j
// Compute single message from the
Send
return msg(R[i]/E[i,j]) to vertex
msg(R[i]/E[i,j]) Vertex Program

Malewicz et al. [PODC’09, SIGMOD’10] 72


GraphX System Design
Distributed Graphs as Tables
(RDDs)
Vertex Routing Edge
Property Graph Table Table Table
(RDD) (RDD) (RDD)B
Part. 1 A
A A 1 2
B C A C

B B 1 B C

C D
A
A D
D C C 1
2D Vertex
A Cut Heuristic
D
A E
D D 1 2
A F
E E 2
F E E D
Part. 2 F F 2 E F
Caching for Iterative mrTriplets
Vertex Edge Table
Table (RDD)
(RDD) Mirror A B
Cache
A
A
A A C
B
B
B B C
C
D C D
C
C
Mirror
Cache
A E
D
D
A A F
E
E D
E E D

FF F E F
Incremental Updates for Iterative
mrTriplets
Vertex Edge Table
Table (RDD)
(RDD) Mirror A B
Cache
Change A
A A C
B
B B C
C
D C D
C
Mirror
Cache
A E
D
A A F

Scan
Change E D
E E D

F F E F
Aggregation for Iterative mrTriplets
Vertex Edge Table
Table (RDD)
(RDD) Mirror A B
Cache
Change A
A A C
Local
B
Change B Aggregate B C
C
D C D
Change C
Mirror
Cache
A E
Change D
A A F

Scan
Local
Change E D
Aggregate
E E D
Change F F E F
Performance Comparisons
Live-Journal: 69 Million Edges
Mahout/Hadoop 1340
Naïve Spark 354
Giraph 207
GraphX 68
GraphLab 22
0 200 400 600 800 1000 1200 1400 1600

Runtime (in seconds, PageRank for 10 iterations)

GraphX is roughly 3x slower than GraphLab


GraphX scales to larger
graphs
Twitter Graph: 1.5 Billion Edges
Giraph 749

GraphX 451

GraphLab 203

0 200 400 600 800

Runtime (in seconds, PageRank for 10 iterations)

GraphX is roughly 2x slower than GraphLab


» Scala + Java overhead: Lambdas, GC time, …
» No shared memory parallelism: 2x increase in comm.
PageRank is just one
stage….

What about a pipeline?


A Small Pipeline in GraphX
Raw Wikipedia Hyperlinks PageRank Top 20 Pages
<</ />>
</>
XML HDFS HDFS

Spark Preprocess Compute Spark Post.


Spark 1492
Giraph + Spark 605
GraphX 342
GraphLab + Spark 375
0 200 400 600 800 1000 1200 1400 1600
Total Runtime (in Seconds)

Timed end-to-end GraphX is faster than


Open Source Project
Alpha release since Spark 0.9

Contributors? Python Bindings?


Graph Processing Systems
• Apache Giraph: java Pregel
implementation
• GraphLab.org: C++ GraphLab
implementation
• NetworkX: python API for small gaphs
• GraphLab Create: commercial GraphLab
python framework for large graphs and
ML
• Gephi: graph visualization framework
Graph Database
Technologies
Property graph data-model for storing and
retrieving graph structured data.
• Neo4j: popular commercial graph
database
• Titan: open-source distributed graph
database
Break!
http://tinyurl.com/ampgraphx

jegonzal@eecs.berkeley.edu
About Scala
High-level language for the Java VM
» Object-oriented + functional programming

Statically typed
» Comparable in speed to Java
» But often no need to write types due to type
inference

Interoperates with Java


» Can use any Java class, inherit from it, etc; can
also call Scala code from Java
Quick Tour
Declaring variables: Java equivalent:
var x: Int = 7 int x = 7;
var x = 7 // type inferred

val y = “hi” // read-only final String y = “hi”;

Functions: Java equivalent:


def square(x: Int): Int = x*x int square(int x) {
return x*x;
def min(a:Int, b:Int): Int = { }
if (a < b) a else b
}

def announce(text: String) { void announce(String text) {


println(text) System.out.println(text);
} }
Quick Tour
Generic types: Java equivalent:
var arr = new Array[Int](8) int[] arr = new int[8];

var lst = List(1, 2, 3) List<Integer> lst =


// type of lst is List[Int] new ArrayList<Integer>();
lst.add(...)

Indexing: Java equivalent:


arr(5) = 7 arr[5] = 7;

println(lst(5)) System.out.println(lst.get(5));
Quick Tour
Processing collections with functional
programming: Function expression (closure)
val list = List(1, 2, 3)

list.foreach(x => println(x)) // prints 1, 2, 3


list.foreach(println) // same

list.map(x => x + 2) // => List(3, 4, 5)


list.map(_ + 2) // same, with placeholder notation

list.filter(x => x % 2 == 1) // => List(1, 3)


list.filter(_ % 2 == 1) // => List(1, 3)

list.reduce((x, y) => x + y) // => 6


list.reduce(_ + _) // => 6
All of these leave the list unchanged (List is immutable)
Other Collection Methods
Scala collections provide many other
functional methods; for example, Google for
“Scala Seq”
Method on Seq[T] Explanation
map(f: T => U): Seq[U] Pass each element through f
flatMap(f: T => Seq[U]): Seq[U] One-to-many map
filter(f: T => Boolean): Seq[T] Keep elements passing f
exists(f: T => Boolean): Boolean True if one element passes
forall(f: T => Boolean): Boolean True if all elements pass
reduce(f: (T, T) => T): T Merge elements using f
groupBy(f: T => K): Map[K,List[T]] Group elements by f(element)
sortBy(f: T => K): Seq[T] Sort elements by f(element)
. . .

You might also like