Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

Spark vs Polars. Real-life Test Case.


A Brave New World.
DANIEL BEACH
NOV 09, 2023

30 2 Share

© 2024 dataengineeringdude ∙ Privacy ∙ Terms ∙ Collection notice


Substack is the home for great culture

1 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

Data Engineering Central is a reader-supported


publication. To receive new posts and support
my work, consider becoming a free or paid
subscriber.

You should check out Prefect, the sponsor of the newsletter this week!
Prefect is a workflow orchestration tool that gives you observability
across all of your data pipelines. Deploy your Python code in minutes with
Prefect Cloud.

2 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

Let me spin you a tale and a story.

3 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

But then we look back and find ourselves in a spot we might not like that
much.

Testing if Polars is up to the task.

4 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

5 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

What we need to find out is … will Polars step up to the plate?

6 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

Spark Pipeline.
Thank you for reading Data Engineering Central.
This post is public so feel free to share it.

7 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

8 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

23/11/04 21:20:14 INFO CodecPool: Got brand-new compressor [.snappy]


23/11/04 21:20:26 ERROR Utils: Aborting task

9 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

java.lang.OutOfMemoryError: Java heap space

spark =
SparkSession.builder.master("local[*]").config("spark.executor.memory",
"12g").getOrCreate()

spark = SparkSession.builder.master("local[4]") \
.config("spark.executor.memory", "12g") \
.config("spark.driver.memory", "2g")

10 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

Polars code.

11 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

12 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

13 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

Polars Results.

The Strange Part.

14 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

15 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

Time to run polars pipeline : 0:00:32.158017

>>> df = spark.read.parquet("hard_drive_failure_metrics")
>>> df.show()
+----------+----+-----+---+--------------------+--------+
| date|year|month|day| model|failures|
+----------+----+-----+---+--------------------+--------+
|2023-03-23|2023| 3| 23| ST14000NM0018| 0|
|2023-06-16|2023| 6| 16|HGST HUH721212ALE600| 0|
|2023-04-23|2023| 4| 23|TOSHIBA MG08ACA16TEY| 0|
|2023-05-12|2023| 5| 12| ST1000LM024 HN| 0|
|2022-10-19|2022| 10| 19| ST12000NM0008| 2|
|2021-08-03|2021| 8| 3|HGST HUH721212ALE600| 0|
|2023-01-19|2023| 1| 19|HGST HMS5C4040BLE640| 1|
|2023-02-28|2023| 2| 28|HGST HUS728T8TALE6L4| 0|
|2023-01-26|2023| 1| 26|HGST HUS728T8TALE6L4| 0|
|2023-01-01|2023| 1| 1|HGST HUH721212ALE600| 0|
|2022-11-09|2022| 11| 9|Seagate BarraCuda...| 0|
|2023-01-25|2023| 1| 25|HGST HUH721212ALE600| 0|
|2023-02-27|2023| 2| 27| ST16000NM005G| 0|
|2022-11-02|2022| 11| 2|TOSHIBA MG08ACA16TEY| 1|
|2023-03-06|2023| 3| 6| WDC WUH721816ALE6L4| 0|
|2022-10-18|2022| 10| 18| ST10000NM001G| 0|
|2023-04-07|2023| 4| 7|Seagate BarraCuda...| 0|
|2023-05-26|2023| 5| 26|TOSHIBA MG07ACA14TEY| 0|
|2023-04-16|2023| 4| 16| WDC WD5000LPVX| 0|
|2021-08-20|2021| 8| 20|HGST HMS5C4040BLE640| 0|
+----------+----+-----+---+--------------------+--------+
only showing top 20 rows

def read_csvs(path: str):

16 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

lazy_df = pl.scan_csv(path)
return lazy_df

def write_parquets(lz, path: str) -> None:


lz.sink_parquet(path,
compression='snappy',
)

lz = read_csvs(read_path)
lz = lz.select(["date", "model", "failure"])
write_parquets(lz, write_path)

Time to run polars pipeline : 0:00:34.843670

>>> df = spark.read.parquet("parquets")
>>> df.show()
+----------+--------------------+-------+

17 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

| date| model|failure|
+----------+--------------------+-------+
|2023-01-01|HGST HDS5C4040ALE630| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
|2023-01-01|HGST HMS5C4040BLE640| 0|
+----------+--------------------+-------+
only showing top 20 rows

>>> df.count()
82181678

Time to write raw data to Data Lake: 0:01:23.223409

18 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

What should we take away?

19 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

Data Engineering Central is a reader-supported


publication. To receive new posts and support
my work, consider becoming a free or paid
subscriber.

30 Likes · 3 Restacks

2 Comments

Write a comment...

Chris Guecia Nov 9, 2023

Great write up!

I too hit that wall of sink_parquet not being able to write partitions.

Regardless, polars is very fun to use. Coming from spark and pandas I really like how familiar
it feels.
LIKE REPLY SHARE

Edgar Nov 9, 2023

Hi Daniel,

Might be worth to rerun this following this guide:

https://luminousmen.com/post/how-to-speed-up-spark-jobs-on-small-test-datasets

BR

E
LIKE REPLY SHARE

20 of 21 30-04-2024, 23:37
Spark vs Polars. Real-life Test Case. - by Daniel Beach https://dataengineeringcentral.substack.com/p/spark-vs-polars-real-life-t...

21 of 21 30-04-2024, 23:37

You might also like