Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Spark Best Practices

USE SPARK DATAFRAME APIS


INSTEAD OF SPARK SQL

sqlContext.table(“table_schema.table_name”).filt
er(col(“prod_cd ”)in (‘’,’’)

spark.sql(“select * from table_schema.table_name


where prod_cd in (‘’,’’,’’)”)

Rationale -----> The API code will be more readable,


testable and reusable. Using SQL based code will lead to
copy/paste practices.
AVOID UNNECESSARY ACTION
COMMANDS

df.filter(…)
# I have removed the count and show actions
from the production code
df.groupBy()

df.filter()
df.count()
df.groupBy()
df.show()

Rationale -----> The unnecessary action commands like


count, show will cause the Spark DAG to be executed
again. These are not required in production code, a fully
tested code doesn’t need to have these intermediate
checks. Generating a final summary for QC checks is a
different thing though.
STORE INTERMEDIATE TABLES IN
TMP DATABASE

df.registerTempTable("temp_table_name")

Rationale -----> We do not need the intermediate tables


after the code has executed successfully in production.
When the code has been developed and tested
completely, the intermediate tables should be created in
tmp databases, so that they cannot be deleted
automatically, and storage cost is optimal
USE PARQUET FILES TO WRITE IN
HDFS

df.write.parquet("path+filename")

Rationale ----->Parquet files store data in column


oriented fashion, and provides good compression. Also,
they have self contained schema, which makes them very
easy to port to other systems. Column based storage
provide excellent performance benefits for
analytics/OLAP workloads.
FILTER DATA AS EARLY AS POSSIBLE

df.filter(…)
df.join(df2,…)
Df.groupBy()

d
df.join(df2,…)
Df.groupBy()
df.filter()

Rationale ----->The idea is to reduce the shuffling and


memory usage as much as possible. If you filter after a
join operation, then to be soon filtered out records have
already participated in shuffling and memory usages of
the application. Also, you should order joins in a way that,
the records get reduced at the earliest.
ONLY SELECT WHAT YOU NEED

sqlContext.table(“db.table”).select([“co1”,”col2”])

sqlContext.table(“db.table”)

Rationale ----->Selecting only the columns you need will


help reduce the volume of shuffled data and overall
memory usage. This will directly result in improvement in
performance.
TEST THE SYNTACTIC AND SEMANTIC
CORRECTNESS OF YOUR CODE ON
SAMPLE DATA

Rationale ----->Unless you are doing machine learning


training, it’s always possible to test your application logic
on sample data. Spend time to get a representative
sample of your data to test your code with more
confidence.
PERFORM QUALITY CHECKS

Rationale ----->Big data systems are often not ACID


compliant, and therefore don’t have in built integrity
checks, you should perform integrity checks manually
wherever needed. E.g duplicate checks , null checks etc
Thanks for Reading.If
you find this post
helpful, follow me for
more such content
Rahul Yadav

You might also like