Professional Documents
Culture Documents
High Performance Computing Using Apache Spark
High Performance Computing Using Apache Spark
● General-purpose.
● Fast.
● APIs.
● Libraries.
Spark essentials
● SparkSession:
○ the main entrypoint to all Spark functionality.
● SparkContext:
○ connects to a cluster manager;
○ acquires executors;
○ sends app code to executors;
○ sends tasks for the executors to run.
Spark essentials
● RDD (Resilient Distributed Datasets):
○ immutable and fault-tolerant collection of elements that can be operated on in parallel.
● RDD operations:
○ transformations;
○ actions.
Spark essentials
● Transformations:
○ produce new RDDs;
○ lazy, not executed until an action is performed.
● The laziness of transformations allow Spark to boost performance by optimizing how a sequence
of transformations is executed at runtime.
Spark essentials
● Actions:
○ return non-RDD objects.