Spark Introduction

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

SPARK INTRODUCTION

Adam aulia rahmadi


• Apache Spark is an open-source distributed general-purpose
cluster-computing framework. Spark provides an interface for programming
entire clusters with implicit data parallelism and fault tolerance. Originally
developed at the University of California, Berkeley's AMPLab, the Spark
codebase was later donated to the Apache Software Foundation, which has
maintained it since.
• Wikipedia
DATA TYPES
DATAFRAME VS RDD

• Rdd
a = sc.parallelize([1,2,3,4])

• dataframe
Df = a.toDF(‘a’)
START SPARK ENGINE

• Basic configuration
LOAD DATA

• From csv
LOAD DATA

• Convert from pandas


SHOW DATA
SPARK UI

http://localhost:4040
SCHEMA AND COLUMNS

• Rename columns
DATAFRAME EXPLORATION

• select • Fill null column

• filter

• count • Null values


DATAFRAME EXPLORATION

• Filter

• aggregation

• join

• pivot
USER DEFINED FUNCTION (UDF)
TEMPORARY TABLE
EXPORT RESULT

Save to csv

Save to parquet

Result folder

Csv folder parquet folder


TURN OFF SPARK ENGINE

• spark.stop()

You might also like