Professional Documents
Culture Documents
01 Intro PDF
01 Intro PDF
¡ Predictive methods
§ Use some variables to predict unknown
or future values of other variables
§ Example: Recommender systems
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensional Duplicate
Spam Queries on Perceptron,
ity document
reduction Detection streams kNN detection
¡ Additional details/instructions at
http://cs246.stanford.edu
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33
¡ Large-scale computing for data mining
problems on commodity hardware
¡ Challenges:
§ How do you distribute computation?
§ How can we make it easy to write distributed
programs?
§ Machines fail:
§ One server may stay up 3 years (1,000 days)
§ If you have 1,000 servers, expect to lose 1/day
§ With 1M machines 1,000 machines fail every day!
C0 C1 D0 C1 C2 C5 C0 C5
C5 C2 C5 C3 D0 D1 … D0 C2
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the
key and output
Input Output
Mappers Reducers
data
The crew of the space
reads
shuttle Endeavor recently
(The, 1) (crew, 1)
read the
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
sequential
exploration. Scientists at
(space, 1)
(the, 1) (the, 1)
NASA are saying that the (the, 3)
Sequentially
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step in (shuttle, 1)
a long-term space-based (shuttle, 1) (the, 1)
man/mache partnership.
(recently, 1)
(Endeavor, 1) (shuttle, 1)
'"The work we're doing now …
Only
-- the robotics we're doing - (recently, 1) (recently, 1)
- is what we're going to
need …………………….. …. …
Big document (key, value) (key, value) (key, value)
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45
map(key, value):
# key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
# key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
F:
Stage 1 groupBy
C: D: E:
join = RDD
¡ Other examples:
§ Link analysis and graph processing
§ Machine Learning algorithms
A B B C A C
a1 b1
⋈
b2 c1 a3 c1
a2
a3
b1
b2
b2 c2 = a3 c2
b3 c3 a4 c3
a4 b3
S
R