(25) Polars Vs Pandas_ Benchmarking performances and beyond _ LinkedIn

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn

25 Try
Home My Network Jobs Messaging Notifications Me For Business

Polars Vs Pandas:
Benchmarking
performances and
beyond
Machine Learning Reply GmbH
Follow
4,711 followers

January 18, 2024

Open Immersive Reader

by Arlind Avdullahi

Introduction

If you have ever done any kind of experimenting in data


science, you must have heard of Pandas. To quote the
corresponding Github documentation, Pandas is a “Flexible
and powerful data analysis / manipulation library for
Python, providing labeled data structures similar to R
data.frame objects, statistical functions, and much more”.
The library is widely popular in the data science community,
because it offers a lot of relevant functionalities, is both
easy and intuitive, and is reasonably fast.

However, a new package was recently published and


promises highly competitive performance. This package is
called Polars and it calls itself a “Blazingly Fast DataFrame
Library”. It claims to be nothing less than “one of the best
performing solutions available”.

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 1/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn

In this article, we intend to challenge this statement by


comparing Pandas and Polars in a scenario close to our
data science use cases. For this, we will use a large data set
from Kaggle.

In the first section of this article, we will review the


performance and efficiency of the most essential
functionalities in data science such as reading, filtering,
etc. We will also assess how difficult it is to transfer prior
knowledge of Pandas syntax to Polars. The syntax of Pandas
is familiar and easy to use for most data scientists, thus it is
important to understand if the Polars syntax entails
additional complexity.

After covering this side-by-side comparison, we will dive


deeper in the Polars world and investigate the promising
Lazy API of Polars.

Finally, we will complete our study with a more qualitative


comparison, considering the resources available, such as
documentation, community support, available extensions
and integrations. These aspects tend to be overlooked in
regular benchmarking, because they cannot be quantified.
However, it is a key aspect for developers, as all developers
know the immense value of communities such as Stack
Overflow.

Before diving in, let us take a brief step back and start by
introducing Polars and the reasons why the tool is so
promising.

The strength of Polars

If Polars brings such a performance boost, it is due to some


key technical features.

1. Written in Rust: Polars offers a Python API. However, it


is written in Rust. This means that the code does not
need to be interpreted, as code in Python would.

1. Out of Core: Polars supports out of core data


transformation with its streaming API. This allows you
to process your results without requiring all your data
to be in memory at the same time.

1. Parallelization: Polars fully utilizes the power of your


machine by dividing the workload among the available
CPU cores without any additional configuration.

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 2/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn

1. Vectorized Query Engine: Polars uses Apache Arrow, a


columnar data format, to process your queries in a
vectorized manner.

1. Lazy Evaluation API: With the lazy API, Polars does not
run each query line-by-line but instead processes the
full query end-to-end.

Benchmarking on a large data set


from Kaggle

We consider a very large dataset, consisting of millions of


rows. For such a volume of data, we roughly expect that the
data processing in pandas takes several hours. When
presenting results, we systematically compare over 10
repetitions to include both mean and standard deviation.
All the experiments were conducted on Apple M1 Pro chip,
on python version 3.9.13.

In the following, we will compare different key


functionalities between pandas and polars, focusing on two
aspects. On the one hand the efficiency: is Polars faster and
by how much? On the other hand: how easy is it to write
the equivalent of Pandas commands in Polars?

Installing and importing the packages

We install the required packages using pip. During the


writing of this article pandas version 2.0.0 and polars version
0.17.0 has been used.

pip install pandas==2.0.0


pip install polars==0.17.0

After installation we import the packages in our runtime


environment

import pandas as pd
import polars as pl

1. Comparing reading time of the dataset

The US Accidents dataset from Kaggle has been used for all
of the experiments presented below. The dataset is

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 3/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn

contained in a csv file and it contains ~17 million rows and


46 columns.

The code to read such a dataset looks very similar for both
pandas and polars

#pandas
pd_df = pd.read_csv("US_Accidents.csv", nrows=1000)

#polars
pl_df = pl.read_csv("US_Accidents.csv", n_rows=1000)

Performance comparison of reading 1000 rows from a CSV file

Using Polars, it takes 4.36 ms ± 3.55 ms to read the first


1000 rows of the CSV file. By contrast, it takes pandas 10.7
ms ± 549 µs. While this tends to indicate that Polars is
faster, the result is less reproducible than with Pandas. In
this case, Polars is always faster, but we cannot entirely
exclude that Polars might in some cases be slower than
Pandas, nor can we control the lack of reproducibility on a
small/sample dataset.

The difference in reading csv files becomes more apparent


when the full dataset is read.

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 4/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
Performance comparison of reading a full CSV file with ~17Millions rows

In our example, Pandas takes 1min 27 s ± 2.43 s to read


the full file, while Polars takes 7.79 s ± 1.11 s. That is a
speedup of over 10 times, which becomes very significant
in data science use cases.

We also notice that the large standard deviation previously


observed in the small sample of the dataset becomes
irrelevant once considering the entire dataset. This is an
important observation, which shows that, even in a testing
phase, we must consider a significant sample of the overall
dataset. As we tend to develop PoC based on small samples
of the data, this observation on the Polars stability is
noteworthy.

Another interesting observation is the following: reading a


dataset with Polars and then converting it to a Pandas
dataframe, takes on average 29.4 s ± 2.75 s. Therefore, it is
faster to read a dataset with Polars and converting it to
Pandas dataframe, than reading the same dataset using
only Pandas.

2. Selecting data columns

The syntax for selecting a subset of the columns using the


Polars package is again almost identical to the syntax of
pandas.

#pandas
pd_df_selected = pd_df[['Severity', 'Start_Time',
'End_Time', 'Station', 'Stop', 'Traffic_Signal']]

#polars
pl_df_selected = pl_df[['Severity', 'Start_Time',
'End_Time', 'Station', 'Stop', 'Traffic_Signal']]

Performance comparison of selecting 6 out of 46 columns

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 5/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn

As for reading the data, Polars also performs a lot better


when it comes to selecting certain columns of the
dataframe. On average, it takes Pandas 256 ms ± 406 ms to
return a dataframe of the selected columns, while it takes
Polars only 573 µs ± 1.68 ms . This is again a speedup of
multiple times.

3. Filtering data

Filtering based on column values, on the other hand, has a


slightly different syntax in polars.

#pandas
filter_pd_df = pd_df[pd_df['Traffic_Signal']==True]

#polars
filter_pl_df =
pl_df.filter(pl.col('Traffic_Signal')==True)

Performance comparison of filtering based on Traffic_Signal==True

When it comes to filtering data, Pandas and Polars are a lot


more similar in terms of performance, however Polars is still
slightly faster than Pandas. On average, it takes pandas 3.8
s ± 2.45 s to filter the rows based on the condition
specified, while it takes polars 1.37 s ± 1.18 s to perform the
same transformation.

4. Sorting

The syntax to sort the data presents minor differences


between the two packages. The sorting function in Pandas
is called sort_values whereas in Polars it is called sort. Beside
the function name, if you want to sort values in descending
order, in pandas you need to use ascending=False while in
polars you have to use descending=True.

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 6/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn

#pandas
sorted_pd_df = pd_df.sort_values(by='Humidity(%)',
ascending=False)

#polars
sorted_pl_df = pl_df.sort("Humidity(%)",
descending=True)

Performance comparison of sorting based on Humidity(%) column

As for filtering, Polars is only slightly faster than Pandas for


sorting. On average pandas takes 10.1 s ± 813 ms while
polars does the sorting in 6.61 s ± 786 ms.

5. Grouping

#pandas
grouped_pd_df = pd_df.groupby(['State'])
['ID'].agg('count')

#polars
grouped_pl_df =
pl_df.groupby('State').agg(pl.col('ID').count())

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 7/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
Performance comparison of grouping based on State

To aggregate a dataframe, Polars once again outperforms


Pandas. It takes Pandas an average of 606 ms ± 19.4 ms to
run the grouping command provided above, while it takes
polars only 107 ms ± 12.1 ms to perform the same task.

6. Conclusion on benchmarking

Performance comparison on a dataset of Kaggle with ~17Millions rows and 46


columns

Lazy API

Polars offers another opportunity to further enhance the


performance of the dataframe operations by using the Lazy
API. Through the Lazy API, polars does not run each query
line-by-line, but instead processes the full query end-to-
end. According to the Polars website, it is important to use
the Lazy API because:

1. The lazy API allows Polars to apply automatic query


optimization with the query optimizer.

2. The lazy API allows you to work with larger than


memory datasets using streaming.

3. The lazy API can catch schema errors before processing


the data.

Polars supports two modes of operation, eager and lazy


API. As an example, let's say that we want to find the
number of accidents of high severity (Severity==4) per
county. Using the eager API, the code reads as follow:

pl_df = pl.read_csv("US_Accidents.csv")
.filter(pl.col('Severity')==4
.groupby(['State', 'County'])
.agg(pl.col('ID').count().alias("Count
Severity"))
.sort("Count Severity", descending=True)

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 8/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn

In order to use the lazy API, we must implement some


syntax changes on the code above. Most importantly, we
must substitute the function “read_csv” by “scan_csv”. This
function returns a LazyFrame, instead of a DataFrame and
that ensures that the lazy API is being used.

Going back to our example, the code that uses lazy API now
reads:

q1 = (
pl.scan_csv("US_Accidents.csv")
.filter(pl.col('Severity')==4)
.groupby(['State',
'County']).agg(pl.col('ID').count().alias("Count
Severity"))
.sort("Count Severity", descending=True)
.collect()
)

Performance comparison between eager and lazy API.

As displayed in the figure above, the lazy API greatly


outperforms the eager API. The Lazy API performs the task
on average 1.27 s ± 203 ms, compared to 8.42 s ± 1.19 s for
the eager API. That is a significant speedup, which is
achieved with minimal changes to the source code.

Qualitative Comparison

So far, our comparative analysis of Pandas and Polars has


been in favor of Polars. The performance boost of Polars
depends on the function, but Polars undeniably always won
when handling large datasets.

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 9/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn

This point is made even clearer when considering the


potential offered by lazy API, which can make Polars even
more efficient. This extensibility allows users to tailor Polars
to their specific requirements, opening up new avenues for
optimization and customization.

However, a data science project is not only about


performance. It is essential to consider the broader context
when selecting a tool for your data analysis needs. When
looking at the statistics of the most used programming
languages among developers worldwide as of 2023 ,
Python accounts for 49.28% while Rust only amounts to
13.05% (and this number was lower in previous years). In
short, Pandas boasts a significantly larger and more mature
community, which translates into a wealth of online
resources and a vast ecosystem of extensions and
integrations. For users who prioritize readily available code
examples, troubleshooting resources, and a well-
documented user base, Pandas offers a clear advantage.

Conclusion

Ultimately, the choice between Pandas and Polars hinges


on your unique project requirements and priorities. If raw
performance and efficiency are paramount, Polars seems to
be the superior choice. On the other hand, if you value the
extensive support and resources offered by a larger
community, Pandas remains a dependable and widely
adopted tool. Striking the right balance between these
factors is a decision that should be made with careful
consideration of your specific use case and objectives.

Report this

Published by

Machine Learning Reply GmbH


4,711 followers Follow
Published • 3mo

𝐂𝐚𝐧 𝐏𝐨𝐥𝐚𝐫𝐬 𝐭𝐫𝐮𝐥𝐲 𝐨𝐮𝐭𝐬𝐡𝐢𝐧𝐞 𝐏𝐚𝐧𝐝𝐚𝐬?

At Machine Learning Reply GmbH, we always want the most efficient and powerful
technologies to answer the needs of our clients 💪 . Therefore, we commit ourselves
to learn and assess any new tool that can speed up our projects 🔍 .

As such, Polars is a “must have” in our benchmarking catalog. It claims to be nothing


less than a “𝐛𝐥𝐚𝐳𝐢𝐧𝐠𝐥𝐲 𝐟𝐚𝐬𝐭 𝐃𝐚𝐭𝐚𝐟𝐫𝐚𝐦𝐞 𝐋𝐢𝐛𝐫𝐚𝐫𝐲” 🏎 and positions itself as “𝐨𝐧𝐞 𝐨𝐟
𝐭𝐡𝐞 𝐛𝐞𝐬𝐭 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐢𝐧𝐠 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧𝐬 𝐚𝐯𝐚𝐢𝐥𝐚𝐛𝐥𝐞”.

In this article, Arlind Avdullahi challenges this statement, confronting Polars with the

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 10/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
widely popular Pandas. He reviews the most crucial aspects, answering questions
such as:
- Is Polars faster in analyzing a large dataset of several millions rows ❓
- Is it easy to transfer prior knowledge from Pandas to Polars❓
- How about the lazy API of Polars❓
- What about the online support, integration and extensions❓

By the end of your reading, you will know how to make more 𝐞𝐧𝐥𝐢𝐠𝐡𝐭𝐞𝐧𝐞𝐝 𝐝𝐞𝐜𝐢𝐬𝐢𝐨𝐧𝐬
𝐭𝐨 𝐜𝐡𝐨𝐨𝐬𝐞 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐬𝐮𝐢𝐭𝐚𝐛𝐥𝐞 𝐭𝐨𝐨𝐥 to tackle your data science project.

Like Comment Share 56 · 3 comments

Reactions

+44

3 Comments
Most relevant

Add a comment…

Andres Aranda • 3rd+ 6d


Software Engineering

Created a gist out of this for comparing approx. performance for 1-time
run on full dataset here
https://gist.github.com/ankandrew/47fa7bc73984981d54839dab57949f4
c

Like Reply

Prashanth Yennampelli • 3rd+ 4w


Data & Analytics Consultant | Data Architect | AWS Solution Architect | Cloud
Migration Consultant

Results may vary based on the type of data source utilized for
benchmarking. Did you check the results for parquet data as input for both
polars and pandas?

Like Reply

Show 1 more comment

Machine Learning Reply GmbH

Follow

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 11/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn

https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 12/12

You might also like