Professional Documents
Culture Documents
SPARK
SPARK
CS240A
Winter 2016. T Yang
>>>numset=set([1, 2, 3, 2])
Duplicated entries are deleted
>>>numset=frozenset([1, 2,3])
Such a set cannot be modified
Python map/reduce
a = [1, 2, 3]
b = [4, 5, 6, 7]
c = [8, 9, 1, 2, 3]
f= lambda x: len(x)
L = map(f, [a, b, c])
[3, 4, 5]
RDD
RDD
RDD
RDD
Interactive Shells
Scala •Python & Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Performance
•Java & Scala are faster
Java due to static typing
JavaRDD<String> lines = sc.textFile(...); •…but Python is often fine
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
Spark Context and Creating RDDs
“to” (to, 1)
(be, 1)(be,1) (be,2)
“to be or” “be” (be, 1)
(not, 1) (not, 1)
“or” (or, 1)
> visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))
> visits.cogroup(pageNames)
# (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”]))
# (“about.html”, ([“3.4.5.6”], [“About”]))
Under The Hood: DAG Scheduler
• General task A: B:
graphs
• Automatically F:
Stage 1 groupBy
pipelines
functions C: D: E:
• Data locality
aware join
• Partitioning
aware Stage 2 map filter Stage 3
to avoid shuffles
= RDD = cached partition
Setting the Level of Parallelism
> words.reduceByKey(lambda x, y: x + y, 5)
> words.groupByKey(5)
> visits.join(pageViews, 5)
More RDD Operators
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, sys.argv[0],
None)
lines = sc.textFile(sys.argv[1])
counts.saveAsTextFile(sys.argv[2])
Create a SparkContext
import org.apache.spark.SparkContext
Scala
import org.apache.spark.SparkContext._
http://<Standalone Master>:8080
(by default)
EXAMPLE APPLICATION:
PAGERANK
Google PageRank
Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
PageRank (one definition)
Iterate until
Source of Image: Lin 2008
convergence
Algorithm demo
1.0
1.0 1.0
1.0
Algorithm
1.0
1 0.5
1
1.0 0.5 1.0
0.5
0.5
1.0
Algorithm
1.85
0.58 1.0
0.58
Algorithm
1.85
0.58 0.5
1.85
0.58 0.29 1.0
0.5
0.29
0.58
Algorithm
1.31
0.39 1.72
...
0.58
Algorithm
0.46 1.37
0.73
HW: SimplePageRank
Random New
To/From 0 1 2 3
Factor Weight
0 0.05 0.283 0.0 0.283 0.10 0.716
1 0.425 0.05 0.0 0.283 0.10 0.858
2 0.425 0.283 0.05 0.283 0.10 1.141
3 0.00 0.283 0.85 0.05 0.10 1.283
Data structure in SimplePageRank