(25) Polars Vs Pandas_ Benchmarking performances and beyond _ LinkedIn

5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
25 Try
Home My Network Jobs Messaging Notifications Me For Business
Polars Vs Pandas:
Benchmarking
performances and
beyond
Machine Learning Reply GmbH
Follow
4,711 followers
January 18, 2024
Open Immersive Reader
by Arlind Avdullahi
Introduction
If you have ever done any kind of experimenting in data

science, you must have heard of Pandas. To quote the
corresponding Github documentation, Pandas is a “Flexible
and powerful data analysis / manipulation library for
Python, providing labeled data structures similar to R
data.frame objects, statistical functions, and much more”.
The library is widely popular in the data science community,
because it offers a lot of relevant functionalities, is both
easy and intuitive, and is reasonably fast.
However, a new package was recently published and

promises highly competitive performance. This package is
called Polars and it calls itself a “Blazingly Fast DataFrame
Library”. It claims to be nothing less than “one of the best
performing solutions available”.
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 1/12
In this article, we intend to challenge this statement by

comparing Pandas and Polars in a scenario close to our
data science use cases. For this, we will use a large data set
from Kaggle.
In the first section of this article, we will review the

performance and efficiency of the most essential
functionalities in data science such as reading, filtering,
etc. We will also assess how difficult it is to transfer prior
knowledge of Pandas syntax to Polars. The syntax of Pandas
is familiar and easy to use for most data scientists, thus it is
important to understand if the Polars syntax entails
additional complexity.
After covering this side-by-side comparison, we will dive

deeper in the Polars world and investigate the promising
Lazy API of Polars.
Finally, we will complete our study with a more qualitative

comparison, considering the resources available, such as
documentation, community support, available extensions
and integrations. These aspects tend to be overlooked in
regular benchmarking, because they cannot be quantified.
However, it is a key aspect for developers, as all developers
know the immense value of communities such as Stack
Overflow.
Before diving in, let us take a brief step back and start by
introducing Polars and the reasons why the tool is so
promising.
The strength of Polars
If Polars brings such a performance boost, it is due to some

key technical features.
1. Written in Rust: Polars offers a Python API. However, it

is written in Rust. This means that the code does not
need to be interpreted, as code in Python would.
1. Out of Core: Polars supports out of core data

transformation with its streaming API. This allows you
to process your results without requiring all your data
to be in memory at the same time.
1. Parallelization: Polars fully utilizes the power of your

machine by dividing the workload among the available
CPU cores without any additional configuration.
1. Vectorized Query Engine: Polars uses Apache Arrow, a

columnar data format, to process your queries in a
vectorized manner.
1. Lazy Evaluation API: With the lazy API, Polars does not
run each query line-by-line but instead processes the
full query end-to-end.
Benchmarking on a large data set

from Kaggle
We consider a very large dataset, consisting of millions of

rows. For such a volume of data, we roughly expect that the
data processing in pandas takes several hours. When
presenting results, we systematically compare over 10
repetitions to include both mean and standard deviation.
All the experiments were conducted on Apple M1 Pro chip,
on python version 3.9.13.
In the following, we will compare different key

functionalities between pandas and polars, focusing on two
aspects. On the one hand the efficiency: is Polars faster and
by how much? On the other hand: how easy is it to write
the equivalent of Pandas commands in Polars?
Installing and importing the packages
We install the required packages using pip. During the

writing of this article pandas version 2.0.0 and polars version
0.17.0 has been used.
pip install pandas==2.0.0

pip install polars==0.17.0
After installation we import the packages in our runtime

environment
import pandas as pd
import polars as pl
1. Comparing reading time of the dataset
The US Accidents dataset from Kaggle has been used for all
of the experiments presented below. The dataset is
contained in a csv file and it contains ~17 million rows and

46 columns.
The code to read such a dataset looks very similar for both
pandas and polars
#pandas
pd_df = pd.read_csv("US_Accidents.csv", nrows=1000)
#polars
pl_df = pl.read_csv("US_Accidents.csv", n_rows=1000)
Performance comparison of reading 1000 rows from a CSV file
Using Polars, it takes 4.36 ms ± 3.55 ms to read the first

1000 rows of the CSV file. By contrast, it takes pandas 10.7
ms ± 549 µs. While this tends to indicate that Polars is
faster, the result is less reproducible than with Pandas. In
this case, Polars is always faster, but we cannot entirely
exclude that Polars might in some cases be slower than
Pandas, nor can we control the lack of reproducibility on a
small/sample dataset.
The difference in reading csv files becomes more apparent

when the full dataset is read.
Performance comparison of reading a full CSV file with ~17Millions rows
In our example, Pandas takes 1min 27 s ± 2.43 s to read

the full file, while Polars takes 7.79 s ± 1.11 s. That is a
speedup of over 10 times, which becomes very significant
in data science use cases.
We also notice that the large standard deviation previously

observed in the small sample of the dataset becomes
irrelevant once considering the entire dataset. This is an
important observation, which shows that, even in a testing
phase, we must consider a significant sample of the overall
dataset. As we tend to develop PoC based on small samples
of the data, this observation on the Polars stability is
noteworthy.
Another interesting observation is the following: reading a

dataset with Polars and then converting it to a Pandas
dataframe, takes on average 29.4 s ± 2.75 s. Therefore, it is
faster to read a dataset with Polars and converting it to
Pandas dataframe, than reading the same dataset using
only Pandas.
2. Selecting data columns
The syntax for selecting a subset of the columns using the

Polars package is again almost identical to the syntax of
pandas.
#pandas
pd_df_selected = pd_df[['Severity', 'Start_Time',
'End_Time', 'Station', 'Stop', 'Traffic_Signal']]
#polars
pl_df_selected = pl_df[['Severity', 'Start_Time',
'End_Time', 'Station', 'Stop', 'Traffic_Signal']]
Performance comparison of selecting 6 out of 46 columns
As for reading the data, Polars also performs a lot better

when it comes to selecting certain columns of the
dataframe. On average, it takes Pandas 256 ms ± 406 ms to
return a dataframe of the selected columns, while it takes
Polars only 573 µs ± 1.68 ms . This is again a speedup of
multiple times.
3. Filtering data
Filtering based on column values, on the other hand, has a

slightly different syntax in polars.
#pandas
filter_pd_df = pd_df[pd_df['Traffic_Signal']==True]
#polars
filter_pl_df =
pl_df.filter(pl.col('Traffic_Signal')==True)
Performance comparison of filtering based on Traffic_Signal==True
When it comes to filtering data, Pandas and Polars are a lot

more similar in terms of performance, however Polars is still
slightly faster than Pandas. On average, it takes pandas 3.8
s ± 2.45 s to filter the rows based on the condition
specified, while it takes polars 1.37 s ± 1.18 s to perform the
same transformation.
4. Sorting
The syntax to sort the data presents minor differences

between the two packages. The sorting function in Pandas
is called sort_values whereas in Polars it is called sort. Beside
the function name, if you want to sort values in descending
order, in pandas you need to use ascending=False while in
polars you have to use descending=True.
#pandas
sorted_pd_df = pd_df.sort_values(by='Humidity(%)',
ascending=False)
#polars
sorted_pl_df = pl_df.sort("Humidity(%)",
descending=True)
Performance comparison of sorting based on Humidity(%) column
As for filtering, Polars is only slightly faster than Pandas for

sorting. On average pandas takes 10.1 s ± 813 ms while
polars does the sorting in 6.61 s ± 786 ms.
5. Grouping
#pandas
grouped_pd_df = pd_df.groupby(['State'])
['ID'].agg('count')
#polars
grouped_pl_df =
pl_df.groupby('State').agg(pl.col('ID').count())
Performance comparison of grouping based on State
To aggregate a dataframe, Polars once again outperforms

Pandas. It takes Pandas an average of 606 ms ± 19.4 ms to
run the grouping command provided above, while it takes
polars only 107 ms ± 12.1 ms to perform the same task.
6. Conclusion on benchmarking
Performance comparison on a dataset of Kaggle with ~17Millions rows and 46

columns
Lazy API
Polars offers another opportunity to further enhance the

performance of the dataframe operations by using the Lazy
API. Through the Lazy API, polars does not run each query
line-by-line, but instead processes the full query end-to-
end. According to the Polars website, it is important to use
the Lazy API because:
1. The lazy API allows Polars to apply automatic query

optimization with the query optimizer.
2. The lazy API allows you to work with larger than

memory datasets using streaming.
3. The lazy API can catch schema errors before processing

the data.
Polars supports two modes of operation, eager and lazy

API. As an example, let's say that we want to find the
number of accidents of high severity (Severity==4) per
county. Using the eager API, the code reads as follow:
pl_df = pl.read_csv("US_Accidents.csv")
.filter(pl.col('Severity')==4
.groupby(['State', 'County'])
.agg(pl.col('ID').count().alias("Count
Severity"))
.sort("Count Severity", descending=True)
In order to use the lazy API, we must implement some

syntax changes on the code above. Most importantly, we
must substitute the function “read_csv” by “scan_csv”. This
function returns a LazyFrame, instead of a DataFrame and
that ensures that the lazy API is being used.
Going back to our example, the code that uses lazy API now
reads:
q1 = (
pl.scan_csv("US_Accidents.csv")
.filter(pl.col('Severity')==4)
.groupby(['State',
'County']).agg(pl.col('ID').count().alias("Count
Severity"))
.sort("Count Severity", descending=True)
.collect()
)
Performance comparison between eager and lazy API.
As displayed in the figure above, the lazy API greatly

outperforms the eager API. The Lazy API performs the task
on average 1.27 s ± 203 ms, compared to 8.42 s ± 1.19 s for
the eager API. That is a significant speedup, which is
achieved with minimal changes to the source code.
Qualitative Comparison
So far, our comparative analysis of Pandas and Polars has

been in favor of Polars. The performance boost of Polars
depends on the function, but Polars undeniably always won
when handling large datasets.
This point is made even clearer when considering the

potential offered by lazy API, which can make Polars even
more efficient. This extensibility allows users to tailor Polars
to their specific requirements, opening up new avenues for
optimization and customization.
However, a data science project is not only about

performance. It is essential to consider the broader context
when selecting a tool for your data analysis needs. When
looking at the statistics of the most used programming
languages among developers worldwide as of 2023 ,
Python accounts for 49.28% while Rust only amounts to
13.05% (and this number was lower in previous years). In
short, Pandas boasts a significantly larger and more mature
community, which translates into a wealth of online
resources and a vast ecosystem of extensions and
integrations. For users who prioritize readily available code
examples, troubleshooting resources, and a well-
documented user base, Pandas offers a clear advantage.
Conclusion
Ultimately, the choice between Pandas and Polars hinges

on your unique project requirements and priorities. If raw
performance and efficiency are paramount, Polars seems to
be the superior choice. On the other hand, if you value the
extensive support and resources offered by a larger
community, Pandas remains a dependable and widely
adopted tool. Striking the right balance between these
factors is a decision that should be made with careful
consideration of your specific use case and objectives.
Report this
Published by

4,711 followers Follow
Published • 3mo
𝐂𝐚𝐧 𝐏𝐨𝐥𝐚𝐫𝐬 𝐭𝐫𝐮𝐥𝐲 𝐨𝐮𝐭𝐬𝐡𝐢𝐧𝐞 𝐏𝐚𝐧𝐝𝐚𝐬?
At Machine Learning Reply GmbH, we always want the most efficient and powerful
technologies to answer the needs of our clients 💪 . Therefore, we commit ourselves
to learn and assess any new tool that can speed up our projects 🔍 .
As such, Polars is a “must have” in our benchmarking catalog. It claims to be nothing

less than a “𝐛𝐥𝐚𝐳𝐢𝐧𝐠𝐥𝐲 𝐟𝐚𝐬𝐭 𝐃𝐚𝐭𝐚𝐟𝐫𝐚𝐦𝐞 𝐋𝐢𝐛𝐫𝐚𝐫𝐲” 🏎 and positions itself as “𝐨𝐧𝐞 𝐨𝐟
𝐭𝐡𝐞 𝐛𝐞𝐬𝐭 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐢𝐧𝐠 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧𝐬 𝐚𝐯𝐚𝐢𝐥𝐚𝐛𝐥𝐞”.
In this article, Arlind Avdullahi challenges this statement, confronting Polars with the
widely popular Pandas. He reviews the most crucial aspects, answering questions
such as:
- Is Polars faster in analyzing a large dataset of several millions rows ❓
- Is it easy to transfer prior knowledge from Pandas to Polars❓
- How about the lazy API of Polars❓
- What about the online support, integration and extensions❓
By the end of your reading, you will know how to make more 𝐞𝐧𝐥𝐢𝐠𝐡𝐭𝐞𝐧𝐞𝐝 𝐝𝐞𝐜𝐢𝐬𝐢𝐨𝐧𝐬
𝐭𝐨 𝐜𝐡𝐨𝐨𝐬𝐞 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐬𝐮𝐢𝐭𝐚𝐛𝐥𝐞 𝐭𝐨𝐨𝐥 to tackle your data science project.
Like Comment Share 56 · 3 comments
Reactions
+44
3 Comments
Most relevant
Add a comment…
Andres Aranda • 3rd+ 6d

Software Engineering
Created a gist out of this for comparing approx. performance for 1-time
run on full dataset here
https://gist.github.com/ankandrew/47fa7bc73984981d54839dab57949f4
c
Like Reply
Prashanth Yennampelli • 3rd+ 4w

Data & Analytics Consultant | Data Architect | AWS Solution Architect | Cloud
Migration Consultant
Results may vary based on the type of data source utilized for
benchmarking. Did you check the results for parquet data as input for both
polars and pandas?
Like Reply
Show 1 more comment
Follow

(25) Polars Vs Pandas_ Benchmarking performances and beyond _ LinkedIn

Uploaded by

Copyright:

Available Formats

You might also like

(25) Polars Vs Pandas_ Benchmarking performances and beyond _ LinkedIn

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(25) Polars Vs Pandas_ Benchmarking performances and beyond _ LinkedIn

Uploaded by

Copyright:

Available Formats

5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn

January 18, 2024

Open Immersive Reader

If you have ever done any kind of experimenting in data

However, a new package was recently published and

In this article, we intend to challenge this statement by

In the first section of this article, we will review the

After covering this side-by-side comparison, we will dive

Finally, we will complete our study with a more qualitative

The strength of Polars

If Polars brings such a performance boost, it is due to some

1. Written in Rust: Polars offers a Python API. However, it

1. Out of Core: Polars supports out of core data

1. Parallelization: Polars fully utilizes the power of your

1. Vectorized Query Engine: Polars uses Apache Arrow, a

Benchmarking on a large data set

We consider a very large dataset, consisting of millions of

In the following, we will compare different key

Installing and importing the packages

We install the required packages using pip. During the

pip install pandas==2.0.0

After installation we import the packages in our runtime

1. Comparing reading time of the dataset

contained in a csv file and it contains ~17 million rows and

Performance comparison of reading 1000 rows from a CSV file

Using Polars, it takes 4.36 ms ± 3.55 ms to read the first

The difference in reading csv files becomes more apparent

In our example, Pandas takes 1min 27 s ± 2.43 s to read

We also notice that the large standard deviation previously

Another interesting observation is the following: reading a

2. Selecting data columns

The syntax for selecting a subset of the columns using the

Performance comparison of selecting 6 out of 46 columns

As for reading the data, Polars also performs a lot better

Filtering based on column values, on the other hand, has a

Performance comparison of filtering based on Traffic_Signal==True

When it comes to filtering data, Pandas and Polars are a lot

The syntax to sort the data presents minor differences

Performance comparison of sorting based on Humidity(%) column

As for filtering, Polars is only slightly faster than Pandas for

To aggregate a dataframe, Polars once again outperforms

Performance comparison on a dataset of Kaggle with ~17Millions rows and 46

Polars offers another opportunity to further enhance the

1. The lazy API allows Polars to apply automatic query

2. The lazy API allows you to work with larger than

3. The lazy API can catch schema errors before processing

Polars supports two modes of operation, eager and lazy

In order to use the lazy API, we must implement some

Performance comparison between eager and lazy API.

As displayed in the figure above, the lazy API greatly

So far, our comparative analysis of Pandas and Polars has

This point is made even clearer when considering the

However, a data science project is not only about

Ultimately, the choice between Pandas and Polars hinges

Machine Learning Reply GmbH

𝐂𝐚𝐧 𝐏𝐨𝐥𝐚𝐫𝐬 𝐭𝐫𝐮𝐥𝐲 𝐨𝐮𝐭𝐬𝐡𝐢𝐧𝐞 𝐏𝐚𝐧𝐝𝐚𝐬?

As such, Polars is a “must have” in our benchmarking catalog. It claims to be nothing

Like Comment Share 56 · 3 comments

Andres Aranda • 3rd+ 6d