Welcome to Scribd!

Spark Best Practices

Uploaded by

0% found this document useful (0 votes)

13 views10 pages

The document provides best practices for using Spark, including: 1. Use Spark DataFrame APIs instead of SQL for more readable, testable and reusable code. 2. Avoid unnecessary actions like count and show in production code to prevent unnecessary DAG execution and resource usage. 3. Store intermediate tables in temporary databases to avoid deleting them accidentally and optimize storage costs. 4. Use Parquet files for storage in HDFS for columnar storage, compression, self-contained schemas and performance benefits. 5. Filter data as early as possible to reduce shuffling and memory usage.

Original Description:

Spark Best Practices (1)SWFQAFWE

Original Title

Spark Best Practices (1)

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

13 views10 pages

Spark Best Practices

Uploaded by

Rahul yadav

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 10

Search inside document

Spark Best Practices

USE SPARK DATAFRAME APIS

INSTEAD OF SPARK SQL

sqlContext.table(“table_schema.table_name”).filt
er(col(“prod_cd ”)in (‘’,’’)

spark.sql(“select * from table_schema.table_name

where prod_cd in (‘’,’’,’’)”)

Rationale -----> The API code will be more readable,

testable and reusable. Using SQL based code will lead to
copy/paste practices.
AVOID UNNECESSARY ACTION
COMMANDS

df.filter(…)
# I have removed the count and show actions
from the production code
df.groupBy()

df.filter()
df.count()
df.groupBy()
df.show()

Rationale -----> The unnecessary action commands like

count, show will cause the Spark DAG to be executed
again. These are not required in production code, a fully
tested code doesn’t need to have these intermediate
checks. Generating a final summary for QC checks is a
different thing though.
STORE INTERMEDIATE TABLES IN
TMP DATABASE

df.registerTempTable("temp_table_name")

Rationale -----> We do not need the intermediate tables

after the code has executed successfully in production.
When the code has been developed and tested
completely, the intermediate tables should be created in
tmp databases, so that they cannot be deleted
automatically, and storage cost is optimal
USE PARQUET FILES TO WRITE IN
HDFS

df.write.parquet("path+filename")

Rationale ----->Parquet files store data in column

oriented fashion, and provides good compression. Also,
they have self contained schema, which makes them very
easy to port to other systems. Column based storage
provide excellent performance benefits for
analytics/OLAP workloads.
FILTER DATA AS EARLY AS POSSIBLE

df.filter(…)
df.join(df2,…)
Df.groupBy()

d
df.join(df2,…)
Df.groupBy()
df.filter()

Rationale ----->The idea is to reduce the shuffling and

memory usage as much as possible. If you filter after a
join operation, then to be soon filtered out records have
already participated in shuffling and memory usages of
the application. Also, you should order joins in a way that,
the records get reduced at the earliest.
ONLY SELECT WHAT YOU NEED

sqlContext.table(“db.table”).select([“co1”,”col2”])

sqlContext.table(“db.table”)

Rationale ----->Selecting only the columns you need will

help reduce the volume of shuffled data and overall
memory usage. This will directly result in improvement in
performance.
TEST THE SYNTACTIC AND SEMANTIC
CORRECTNESS OF YOUR CODE ON
SAMPLE DATA

Rationale ----->Unless you are doing machine learning

training, it’s always possible to test your application logic
on sample data. Spend time to get a representative
sample of your data to test your code with more
confidence.
PERFORM QUALITY CHECKS

Rationale ----->Big data systems are often not ACID

compliant, and therefore don’t have in built integrity
checks, you should perform integrity checks manually
wherever needed. E.g duplicate checks , null checks etc
Thanks for Reading.If
you find this post
helpful, follow me for
more such content
Rahul Yadav

API Getting Started
Document18 pages
API Getting Started
zahidneaka
No ratings yet
Lab1 Shell Interface
Document7 pages
Lab1 Shell Interface
douheng
No ratings yet
SAP TaxAudit Handbook
Document54 pages
SAP TaxAudit Handbook
ragurao
50% (2)
DP-203T00 Microsoft Azure Data Engineering-03
Document21 pages
DP-203T00 Microsoft Azure Data Engineering-03
Javier Madrigal
No ratings yet
Db2 Cheat Sheet
Document6 pages
Db2 Cheat Sheet
saurav_it
No ratings yet
Sqlmap Cheat Sheet
Document12 pages
Sqlmap Cheat Sheet
Stavros T.
No ratings yet
Use EXPLAIN PLAN and TKPROF To Tune Your Applications
Document13 pages
Use EXPLAIN PLAN and TKPROF To Tune Your Applications
amol
100% (5)
Erpbasic - Blogspot.in 2012 01 Structured Query Languagesql in Baan
Document3 pages
Erpbasic - Blogspot.in 2012 01 Structured Query Languagesql in Baan
knp8swa
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
Document8 pages
Spark 3.0 New Features: Spark With GPU Support
Mohammed Hussein
No ratings yet
Got A Better Name? Please Let Me Know!
Document30 pages
Got A Better Name? Please Let Me Know!
Asep Wijaya
No ratings yet
Programming With SQL: Bringing Data To Your Codebase
Document34 pages
Programming With SQL: Bringing Data To Your Codebase
Good Good
No ratings yet
Hana Interview Question
Document30 pages
Hana Interview Question
laxmi.crj4880
No ratings yet
Spark Job Dataproc
Document4 pages
Spark Job Dataproc
Denys Stolbov
No ratings yet
Use Explain Plan
Document53 pages
Use Explain Plan
Riaz Sheikh
No ratings yet
IntroSession Log
Document6 pages
IntroSession Log
Vinoth Sivaperumal
No ratings yet
DP203 - 216 Questions
Document212 pages
DP203 - 216 Questions
Akash Singh
No ratings yet
Dynamics Trucante Delete Cleanup para Export Import
Document5 pages
Dynamics Trucante Delete Cleanup para Export Import
Eduardo Machado
No ratings yet
Configuration Guide: Smartconnector™ For Oracle Audit DB
Document21 pages
Configuration Guide: Smartconnector™ For Oracle Audit DB
jakpyke
No ratings yet
DB12c Tuning New Features Alex Zaballa PDF
Document82 pages
DB12c Tuning New Features Alex Zaballa PDF
Ganesh Karalkar
No ratings yet
Data Purge Algorithm: Deleting Unwanted Data From A DB2 LUW Database
Document9 pages
Data Purge Algorithm: Deleting Unwanted Data From A DB2 LUW Database
Peepee poopoo
No ratings yet
Clone From RAC E-Business Suite To Single Instance, Oracle Apps DBA Kaparelis PDF
Document2 pages
Clone From RAC E-Business Suite To Single Instance, Oracle Apps DBA Kaparelis PDF
Arshad Hussain
No ratings yet
Powershell EMC Performance Scripts 022711
Document7 pages
Powershell EMC Performance Scripts 022711
Shane Astronomo
No ratings yet
DBA Genesis - Oracle DBA Scripts
Document35 pages
DBA Genesis - Oracle DBA Scripts
Kamaldeep Nainwal
60% (5)
Microsoft Ensurepass DP 203 Dumps 2023 Dec 24 by Ferdinand 149q
Document51 pages
Microsoft Ensurepass DP 203 Dumps 2023 Dec 24 by Ferdinand 149q
m3aistore
No ratings yet
Getting To Know Addm
Document14 pages
Getting To Know Addm
megemar
No ratings yet
DBMS Profiler
Document4 pages
DBMS Profiler
korlapati Manikanta
No ratings yet
Dbms Manual
Document92 pages
Dbms Manual
Amjath Khan
No ratings yet
Cs1010 Model Key
Document18 pages
Cs1010 Model Key
Kumar Mohan
No ratings yet
Notes
Document12 pages
Notes
Chris Harris
No ratings yet
Oracle DBA - Oracle Apps DBA Instant Solutions
Document47 pages
Oracle DBA - Oracle Apps DBA Instant Solutions
Abdool Taslim Peersaheb
No ratings yet
How To Recreate The Control File On 10gR2 RAC With ASM and Data Guard
Document11 pages
How To Recreate The Control File On 10gR2 RAC With ASM and Data Guard
Bhupinder Singh
No ratings yet
SQL
Document4 pages
SQL
Ashok kumar
No ratings yet
Generating SQL Trace Files
Document24 pages
Generating SQL Trace Files
ramaniqbal
No ratings yet
SQL
Document3 pages
SQL
Nagaraju
No ratings yet
Stored Procedures - I
Document6 pages
Stored Procedures - I
Nitin Patel
No ratings yet
DG 12c Setup Rac Phys Standby To Rac Prim
Document15 pages
DG 12c Setup Rac Phys Standby To Rac Prim
Piccola Tonia
No ratings yet
Lab Manual: Database Managment Systems LAB
Document40 pages
Lab Manual: Database Managment Systems LAB
small kids video Kanish and Pradesh
No ratings yet
Netconf and Yang Status, Tutorial, Demo: J Urgen SCH Onw Alder
Document40 pages
Netconf and Yang Status, Tutorial, Demo: J Urgen SCH Onw Alder
Ayman Seaudi
No ratings yet
Object Library SQL Scripts
Document7 pages
Object Library SQL Scripts
Husnain A Ali
No ratings yet
Andrei Dumitru - Influxdb
Document26 pages
Andrei Dumitru - Influxdb
williawo
No ratings yet
Installing Tivoli System Automation For High Availability of DB2 UDB BCU On AIX Redp4254
Document14 pages
Installing Tivoli System Automation For High Availability of DB2 UDB BCU On AIX Redp4254
bupbechanh
No ratings yet
Migration Steps
Document9 pages
Migration Steps
Jegan Nagarajan
No ratings yet
Use Explain Plan and TKPROF To Tune Your Applications: Roger Schrag Database Specialists, Inc
Document56 pages
Use Explain Plan and TKPROF To Tune Your Applications: Roger Schrag Database Specialists, Inc
Brijesh Gogia
No ratings yet
Oracle (1) Database 10g Administration Workshop I
Document5 pages
Oracle (1) Database 10g Administration Workshop I
Vishal S Rana
No ratings yet
400 Troubleshoots Sage-ERP-X3 - Configuration-Console Wiki
Document6 pages
400 Troubleshoots Sage-ERP-X3 - Configuration-Console Wiki
fsussan
No ratings yet
CMcli Overview
Document3 pages
CMcli Overview
Jeff Williams
No ratings yet
Dbms Manual
Document75 pages
Dbms Manual
sridhar102
No ratings yet
Async Tasks With Apache Airflow
Document111 pages
Async Tasks With Apache Airflow
Sakshi Arts
No ratings yet
XP Cmdshell
Document5 pages
XP Cmdshell
wasim_ss
No ratings yet
Importer and Exporter Product For Data Analysis Based On Extract, Transform, Load (ETL) and Regular Expression With Python Programming .Teway
Document26 pages
Importer and Exporter Product For Data Analysis Based On Extract, Transform, Load (ETL) and Regular Expression With Python Programming .Teway
Bharat Thakur
No ratings yet
AS400 Iseries Tips Tricks Guides Revision Notes Learnings
Document34 pages
AS400 Iseries Tips Tricks Guides Revision Notes Learnings
VISHNU400
No ratings yet
Guide To Thinking
Document98 pages
Guide To Thinking
davelnx2291
No ratings yet
2019 Marking Scheme
Document5 pages
2019 Marking Scheme
tholambugollaict
No ratings yet
Performance Tunning Steps
Document11 pages
Performance Tunning Steps
Vinu3012
No ratings yet
Concepts of IMS Database
Document6 pages
Concepts of IMS Database
Kumar Abhinav
No ratings yet
Developer Notes
Document26 pages
Developer Notes
Alexandr
No ratings yet
05 Functions
Document6 pages
05 Functions
jen
No ratings yet
The Definitive Guide to Azure Data Engineering: Modern ELT, DevOps, and Analytics on the Azure Cloud Platform
From Everand
The Definitive Guide to Azure Data Engineering: Modern ELT, DevOps, and Analytics on the Azure Cloud Platform
Ron C. L'Esteve
No ratings yet
MVS JCL Utilities Quick Reference, Third Edition
From Everand
MVS JCL Utilities Quick Reference, Third Edition
Robert Wingate
Rating: 5 out of 5 stars
5/5 (1)
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
MNVJH Mkbjjhvcujgvcjhvikvbh
Document1 page
MNVJH Mkbjjhvcujgvcjhvikvbh
Rahul yadav
No ratings yet
MNXBJCH BS
Document5 pages
MNXBJCH BS
Rahul yadav
No ratings yet
Add A Heading Your Paragraph Text
Document1 page
Add A Heading Your Paragraph Text
Rahul yadav
No ratings yet
Spark Best Practices
Document2 pages
Spark Best Practices
Rahul yadav
No ratings yet
Untitled Design
Document1 page
Untitled Design
Rahul yadav
No ratings yet
List Compression
Document1 page
List Compression
Rahul yadav
No ratings yet
Section 1 - Intro
Document14 pages
Section 1 - Intro
Rahul yadav
No ratings yet
Vertica Enterprise Brief
Document4 pages
Vertica Enterprise Brief
WLS
No ratings yet
Data Mining Transparencies
Document50 pages
Data Mining Transparencies
Rishika Shukla
No ratings yet
A Major Project Report Final Year
Document70 pages
A Major Project Report Final Year
sandy
No ratings yet
10 SQL Server Problems and Solutions
Document3 pages
10 SQL Server Problems and Solutions
prassu1
100% (2)
HCR 210 MASTER Teaching Effectively
Document24 pages
HCR 210 MASTER Teaching Effectively
raj346
No ratings yet
Mobile Web Applications (MWA)
Document41 pages
Mobile Web Applications (MWA)
Sambasivarao Kethineni
No ratings yet
Database Security
Document11 pages
Database Security
IJRASETPublications
100% (1)
Graphical Password by Image Segmentation System: A Project Report On
Document71 pages
Graphical Password by Image Segmentation System: A Project Report On
Dasari Deepak
No ratings yet
KMI KPI KSI Extension of Kpi Measures For Effective Lean Management
Document10 pages
KMI KPI KSI Extension of Kpi Measures For Effective Lean Management
Wilson Silveira
No ratings yet
Geospatial Analysis With SQL A Hands On Guide To Performing Geospatial Analysis by Unlocking The Syntax of Spatial SQL Mcclain Full Chapter PDF
Document70 pages
Geospatial Analysis With SQL A Hands On Guide To Performing Geospatial Analysis by Unlocking The Syntax of Spatial SQL Mcclain Full Chapter PDF
taffahkoljak
100% (6)
Accessing MySQL From PHP
Document32 pages
Accessing MySQL From PHP
duylinh65
No ratings yet
Chapter 1 Accounting Information Systems: An Overview
Document28 pages
Chapter 1 Accounting Information Systems: An Overview
Peishi Ong
No ratings yet
Geographical Data Modeling: Longley Et Al., Ch. 8
Document47 pages
Geographical Data Modeling: Longley Et Al., Ch. 8
VictorJah Lion Contreras
No ratings yet
Kepware Opc
Document295 pages
Kepware Opc
Trần Khánh
No ratings yet
Modularisation: Branches and Loops
Document8 pages
Modularisation: Branches and Loops
Janakiram
No ratings yet
Features of Good Relational Design and Schema Refinement 1
Document25 pages
Features of Good Relational Design and Schema Refinement 1
Shivam
No ratings yet
Ptu PHD Thesis Format
Document8 pages
Ptu PHD Thesis Format
EssayHelperAlbuquerque
100% (2)
Citizen Card System
Document15 pages
Citizen Card System
vignesh
No ratings yet
Awp Practicals
Document75 pages
Awp Practicals
Aneeel
No ratings yet
Chapter 4-Auditing Database Systems
Document37 pages
Chapter 4-Auditing Database Systems
Mai Thị Ngọc Ánh
100% (1)
Mark Samuel Nyon Resume
Document4 pages
Mark Samuel Nyon Resume
Mark Nyon
No ratings yet
Chapter 4-Application, Data & Host Security
Document71 pages
Chapter 4-Application, Data & Host Security
marya
No ratings yet
AIA Help Doc (For Installation)
Document7 pages
AIA Help Doc (For Installation)
AshutoshRamanPandey
No ratings yet
Aveva Structural Design User Guide PDF
Document129 pages
Aveva Structural Design User Guide PDF
shanmugam
No ratings yet
Save Costs When Moving To S - 4HANA - 190916
Document23 pages
Save Costs When Moving To S - 4HANA - 190916
rsreevats
No ratings yet
Resourceslist
Document17 pages
Resourceslist
Ben Franks
No ratings yet
Emic SIP Report
Document55 pages
Emic SIP Report
kunal naidu
No ratings yet
BCA - Credit20 Sem II - LCDD PDF
Document8 pages
BCA - Credit20 Sem II - LCDD PDF
shruti mankar
No ratings yet