Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 10

High Performance Computing

using Apache Spark

Eliezer Beczi December 7,


2020
Introduction
● More data means more computational challenges.

● Single machines can’t handle data sizes anymore.

● The need to extend computation to multiple nodes.


PySpark

Why Apache Spark?


● Open-source.

● General-purpose.

● Fast.

● APIs.

● Libraries.
Spark essentials
● SparkSession:
○ the main entrypoint to all Spark functionality.

● SparkContext:
○ connects to a cluster manager;
○ acquires executors;
○ sends app code to executors;
○ sends tasks for the executors to run.
Spark essentials
● RDD (Resilient Distributed Datasets):
○ immutable and fault-tolerant collection of elements that can be operated on in parallel.

● RDD operations:
○ transformations;
○ actions.
Spark essentials
● Transformations:
○ produce new RDDs;
○ lazy, not executed until an action is performed.

● The laziness of transformations allow Spark to boost performance by optimizing how a sequence
of transformations is executed at runtime.
Spark essentials
● Actions:
○ return non-RDD objects.

● Map-Reduce processing technique.


Spark SQL
● DataFrames:
○ immutable and fault-tolerant collection of elements that can be operated on in
parallel.

● DataFrames are organized into named columns.

● Conceptually equivalent to a table in RDB.


Spark SQL
● DataFrames can be easily queried using SQL
operations.

● Spark allows to run queries directly on DataFrames


similar to how transformations are performed on
RDDs.
Thank you for your attention!

You might also like