* change the master parameter to change the distributing computing
* spark has web for status: localhost somethign
* sparkcontext stop sc.stop(), when the same thing is running already * collect: reading the file and create an RDD * cache: only on RAM, persist: both RAM and disk, if it doesn't fit the memory
* map --> shuffling here (grouping by the framework) --> reduceByKey
* aggregate: seqOp: run in each subarray, comOp: combine the results between workers. * reduceByKey vs groupByKey: reduceByKey reduces first in the RDD before sending to shuffling stage
* accumulator may be prone to fault: some machine crashed, it reruns and the global variable may be wrong