Professional Documents
Culture Documents
(25) Polars Vs Pandas_ Benchmarking performances and beyond _ LinkedIn
(25) Polars Vs Pandas_ Benchmarking performances and beyond _ LinkedIn
(25) Polars Vs Pandas_ Benchmarking performances and beyond _ LinkedIn
25 Try
Home My Network Jobs Messaging Notifications Me For Business
Polars Vs Pandas:
Benchmarking
performances and
beyond
Machine Learning Reply GmbH
Follow
4,711 followers
by Arlind Avdullahi
Introduction
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 1/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
Before diving in, let us take a brief step back and start by
introducing Polars and the reasons why the tool is so
promising.
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 2/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
1. Lazy Evaluation API: With the lazy API, Polars does not
run each query line-by-line but instead processes the
full query end-to-end.
import pandas as pd
import polars as pl
The US Accidents dataset from Kaggle has been used for all
of the experiments presented below. The dataset is
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 3/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
The code to read such a dataset looks very similar for both
pandas and polars
#pandas
pd_df = pd.read_csv("US_Accidents.csv", nrows=1000)
#polars
pl_df = pl.read_csv("US_Accidents.csv", n_rows=1000)
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 4/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
Performance comparison of reading a full CSV file with ~17Millions rows
#pandas
pd_df_selected = pd_df[['Severity', 'Start_Time',
'End_Time', 'Station', 'Stop', 'Traffic_Signal']]
#polars
pl_df_selected = pl_df[['Severity', 'Start_Time',
'End_Time', 'Station', 'Stop', 'Traffic_Signal']]
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 5/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
3. Filtering data
#pandas
filter_pd_df = pd_df[pd_df['Traffic_Signal']==True]
#polars
filter_pl_df =
pl_df.filter(pl.col('Traffic_Signal')==True)
4. Sorting
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 6/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
#pandas
sorted_pd_df = pd_df.sort_values(by='Humidity(%)',
ascending=False)
#polars
sorted_pl_df = pl_df.sort("Humidity(%)",
descending=True)
5. Grouping
#pandas
grouped_pd_df = pd_df.groupby(['State'])
['ID'].agg('count')
#polars
grouped_pl_df =
pl_df.groupby('State').agg(pl.col('ID').count())
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 7/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
Performance comparison of grouping based on State
6. Conclusion on benchmarking
Lazy API
pl_df = pl.read_csv("US_Accidents.csv")
.filter(pl.col('Severity')==4
.groupby(['State', 'County'])
.agg(pl.col('ID').count().alias("Count
Severity"))
.sort("Count Severity", descending=True)
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 8/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
Going back to our example, the code that uses lazy API now
reads:
q1 = (
pl.scan_csv("US_Accidents.csv")
.filter(pl.col('Severity')==4)
.groupby(['State',
'County']).agg(pl.col('ID').count().alias("Count
Severity"))
.sort("Count Severity", descending=True)
.collect()
)
Qualitative Comparison
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 9/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
Conclusion
Report this
Published by
At Machine Learning Reply GmbH, we always want the most efficient and powerful
technologies to answer the needs of our clients 💪 . Therefore, we commit ourselves
to learn and assess any new tool that can speed up our projects 🔍 .
In this article, Arlind Avdullahi challenges this statement, confronting Polars with the
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 10/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
widely popular Pandas. He reviews the most crucial aspects, answering questions
such as:
- Is Polars faster in analyzing a large dataset of several millions rows ❓
- Is it easy to transfer prior knowledge from Pandas to Polars❓
- How about the lazy API of Polars❓
- What about the online support, integration and extensions❓
By the end of your reading, you will know how to make more 𝐞𝐧𝐥𝐢𝐠𝐡𝐭𝐞𝐧𝐞𝐝 𝐝𝐞𝐜𝐢𝐬𝐢𝐨𝐧𝐬
𝐭𝐨 𝐜𝐡𝐨𝐨𝐬𝐞 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐬𝐮𝐢𝐭𝐚𝐛𝐥𝐞 𝐭𝐨𝐨𝐥 to tackle your data science project.
Reactions
+44
3 Comments
Most relevant
Add a comment…
Created a gist out of this for comparing approx. performance for 1-time
run on full dataset here
https://gist.github.com/ankandrew/47fa7bc73984981d54839dab57949f4
c
Like Reply
Results may vary based on the type of data source utilized for
benchmarking. Did you check the results for parquet data as input for both
polars and pandas?
Like Reply
Follow
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 11/12
5/6/24, 12:31 PM (25) Polars Vs Pandas: Benchmarking performances and beyond | LinkedIn
https://www.linkedin.com/pulse/polars-vs-pandas-benchmarking-performances-beyond-l6svf/ 12/12