Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

11/29/2020 Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

Blog For Managers For Engineers Let's talk

Scaling Pandas:
Comparing
Dask, Ray,
Modin, Vaex,
and RAPIDS
How can you process more data quicker?

by Markus Schmitt

Python and its most popular data wrangling library, Pandas, are soaring in popularity.

Compared to competitors like Java, Python and Pandas make data exploration and

transformation simple.

But both Python and Pandas are known to have issues around scalability and efficiency.

Python loses some efficiency right off the bat because it’s an interpreted, dynamically

typed language. But more importantly, Python has always focused on simplicity and

readability over raw power. Similarly, Pandas focuses on offering a simple, high-level API,

largely ignoring performance. In fact, the creator of Pandas wrote “The 10 things I hate

about pandas,” which summarizes these issues:

https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray 1/10
11/29/2020 Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

Blog For Managers For Engineers Let's talk

Performance issues and lack of flexibility are the main things

Pandas’ own creator doesn’t like about the library. (source)

So it’s no surprise that many developers are trying to add more power to Python and

Pandas in various ways. Some of the most notable projects are:

Dask: a low-level scheduler and a high-level partial Pandas replacement, geared

toward running code on compute clusters.

Ray: a low-level framework for parallelizing Python code across processors or

clusters.

Modin: a drop-in replacement for Pandas, powered by either Dask or Ray.

Vaex: a partial Pandas replacement that uses lazy evaluation and memory mapping

to allow developers to work with large datasets on standard machines.

RAPIDS: a collection of data-science libraries that run on GPUs and include cuDF, a

partial replacement for Pandas.

There are others, too. Below is an overview of the Python data wrangling landscape:

https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray 2/10
11/29/2020 Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

Blog For Managers For Engineers Let's talk

Dask, Modin, Vaex, Ray, and CuDF are often considered potential alternatives to each other. Source:

Created with this tool

So if you’re working with a lot of data and need faster results, which should you use?

Just tell me which one to try


Before you can make a decision about which tool to use, it’s good to have some more

context about each of their approaches. We’ll compare each of them closely, but you’ll

probably want to try them out in the following order:

Modin, with Ray as a backend. By installing these, you might see significant benefit

by changing just a single line (`import pandas as pd` to `import modin.pandas as


pd`). Unlike the other tools, Modin aims to reach full compatibility with Pandas.

Dask, a larger and hence more complicated project. But Dask also provides

Dask.dataframe, a higher-level, Pandas-like library that can help you deal with out-of-
core datasets.

Vaex, which is designed to help you work with large data on a standard laptop. Its

Pandas replacement covers some of the Pandas API, but it’s more focused on
exploration and visualization.

RAPIDS, if you have access to NVIDIA graphics cards.

https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray 3/10
11/29/2020 Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

Quick comparison
Blog For Managers For Engineers Let's talk

Each of the libraries we examine has different strengths, weaknesses, and scaling

strategies. The following table gives a broad overview of these. Of course, as with many

things, most of the scores below are heavily dependent on your exact situation. 

Dask and Ray are more mature, but Modin and Vaex are easier to get started with. Rapids is useful if you

have access to GPUs.

These are subjective grades, and they may vary widely given your specific circumstances.

When assigning these grades, we considered:

Maturity: The time since the first commit and the number of commits.

Popularity: The number of GitHub stars.

Ease of Adoption: The amount of knowledge expected from users, presumed

hardware resources, and ease of installation.

Scaling ability: The broad dataset size limits for each tool, depending on whether it

relies mainly on RAM, hard drive space on a single machine, or can scale up to
clusters of machines. 

Use case: Whether the libraries are designed to speed up Python software in general

(“General”), are focused on data science and machine learning (“Data science”), or

are limited to simply replacing Pandas’ ‘DataFrame’ functionality (“DataFrame”).

CPUs, GPUs, Clusters, or Algorithms?


If your dataset is too large to work with efficiently on a single machine, your main options

are to run your code across…

...multiple threads or processors: Modern CPUs have several independent cores,

and each core can run many threads. Ensuring that your program uses all the
potential processing power by parallelizing across cores is often the easiest place to
start.

https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray 4/10
11/29/2020 Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

...GPU cores: Graphics cards were originally designed to efficiently carry out basic
Blog
operations For Managers
on millions of pixels in parallel.For Engineers
However, developers soon saw other uses Let's talk
for this power, and “GP-GPU” (general processing on a graphics processing unit) is
now a popular way to speed up code that relies heavily on matrix manipulations.

...compute clusters: Once you hit the limits of a single machine, you need a

networked cluster of machines, working cooperatively.

Apart from adding more hardware resources, clever algorithms can also improve

efficiency. Tools like Vaex rely heavily on lazy evaluation (not doing any computation until

it’s certain the results are needed) and memory mapping (treating files on hard drives as

if they were loaded into RAM).

None of these strategies is inherently better than the others, and you should choose the

one that suits your specific problem.

Parallel programming (no matter whether you’re using threads, CPU cores, GPUs, or

clusters) offers many benefits, but it’s also quite complex, and it makes tasks such as

debugging far more difficult.

Modern libraries can hide some – but not all – of this added complexity. No matter which

tools you use, you’ll run the risk of expecting everything to work out neatly (below left), but

getting chaos instead (below right).

Parallel processing doesn’t always work out as neatly as you expect. (Source)

Dask vs. Ray vs. Modin vs. Vaex vs. RAPIDS

https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray 5/10
11/29/2020 Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

While not all of these libraries are direct alternatives to each other, it’s useful to compare
Blog For Managers For Engineers Let's talk
them each head-to-head when deciding which one(s) to use for a project.

Before getting into the details, note that:

RAPIDS is a collection of libraries. For this comparison, we consider only the cuDF

component, which is the RAPIDS equivalent of Pandas.

Dask is better thought of as two projects: a low-level Python scheduler (similar in


some ways to Ray) and a higher-level Dataframe module (similar in many ways to
Pandas).

Dask vs. Ray

Dask (as a lower-level scheduler) and Ray overlap quite a bit in their goal of making it

easier to execute Python code in parallel across clusters of machines. Dask focuses more

on the data science world, providing higher-level APIs that in turn provide partial

replacements for Pandas, NumPy, and scikit-learn, in addition to a low-level scheduling

and cluster management framework.

The creators of Dask and Ray discuss how the libraries compare in this GitHub thread,

and they conclude that the scheduling strategy is one of the key differentiators. Dask uses

a centralized scheduler to share work across multiple cores, while Ray uses distributed

bottom-up scheduling.

Dask vs. Modin

Dask (the higher-level Dataframe) acknowledges the limitations of the Pandas API, and

while it partially emulates this for familiarity, it doesn’t aim for full Pandas compatibility. If

you have complicated existing Pandas code, it’s unlikely that you can simply switch out

Pandas for Dask.Dataframe and have everything work as expected. By contrast, this is

exactly the goal Modin is working toward: 100% coverage of Pandas. Modin can run on

top of Dask but was originally built to work with Ray, and that integration remains more

mature.

Dask vs. Vaex

https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray 6/10
11/29/2020 Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

Dask (Dataframe) is not fully compatible with Pandas, but it’s pretty close. These close ties
Blog For Managers For Engineers Let's talk
mean that Dask also carries some of the baggage inherent to Pandas. Vaex deviates more

from Pandas (although for basic operations, like reading data and computing summary

statistics, it’s very similar) and therefore is also less constrained by it.

Ultimately, Dask is more focused on letting you scale your code to compute clusters, while

Vaex makes it easier to work with large datasets on a single machine. Vaex also provides

features to help you easily visualize and plot large datasets, while Dask focuses more on

data processing and wrangling.

Dask vs. RAPIDS (cuDF)

Dask and RAPIDS play nicely together via an integration provided by RAPIDS. If you have

a compute cluster, you should use Dask. If you have an NVIDIA graphics card, you should

use RAPIDS. If you have a compute cluster of NVIDIA GPUs, you should use both.

Ray vs. Modin or Vaex or RAPIDS

It’s not that meaningful to compare Ray to Modin, Vaex, or RAPIDS. Unlike the other

libraries, Ray doesn’t offer high-level APIs or a Pandas equivalent. Instead, Ray powers

Modin and integrates with RAPIDS in a similar way to Dask.

Modin vs. Vaex

As with the Dask and Vaex comparison, Modin’s goal is to provide a full Pandas

replacement, while Vaex deviates more from Pandas. Modin should be your first port of

call if you’re looking for a quick way to speed up existing Pandas code, while Vaex is more

likely to be interesting for new projects or specific use cases (especially visualizing large

datasets on a single machine).

Modin vs. RAPIDS (cuDF)

Modin scales Pandas code by using many CPU cores, via Ray or Dask. RAPIDS scales

Pandas code by running it on GPUs. If you have GPUs available, give RAPIDS a try. But

https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray 7/10
11/29/2020 Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

the easiest win is likely to come from Modin, and you should probably turn to RAPIDS only
Blog For Managers For Engineers Let's talk
after you’ve tried Modin first.

Vaex vs. RAPIDS (cuDF)

Vaex and RAPIDS are similar in that they can both provide performance boosts on a

single machine: Vaex by better utilizing your computer’s hard drive and processor cores,

and RAPIDS by using your computer’s GPU (if it’s available and compatible). The RAPIDS

project as a whole aims to be much broader than Vaex, letting you do machine learning

end-to-end without the data leaving your GPU. Vaex is better for prototyping and data

exploration, letting you explore large datasets on consumer-grade machines.

Final remarks: Premature optimization is


the root of all evil
It’s fun to play with new, specialized tools. That said, many projects suffer from over-

engineering and premature optimization. If you haven’t run into scaling or efficiency

problems yet, there’s nothing wrong with using Python and Pandas on their own. They are

widely used and offer maturity and stability, along with simplicity.

You should only start looking into the libraries discussed here once you’ve reached the

limitations of Python and Pandas on their own. Otherwise you risk spending too much

time choosing and configuring libraries instead of making progress on your project.

At DataRevenue, we’ve built many projects with these libraries and know when and how to

use them. If you need a second opinion, reach out to us. We’re happy to help.

Get Notified of New Articles


Leave your email to get our weekly newsletter.

Enter your Email Subscribe

https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray 8/10
11/29/2020 Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

Blog For Managers For Engineers Let's talk

Keep reading

Scaling Machine Machine Hire ML


Learning Learning Engineers, not
Pipelines explained Data Scientists
How to parallelize and visually Machine Learning
distribute your Python Engineers finally deliver
9 curated images,
machine learning on the promise of AI.
interactive tools and
pipelines with Luigi,
flowcharts that explain
Docker, and Kubernetes Read More
machine learning

Read More
Read More

Talk to us about how


machine learning can

https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray 9/10
11/29/2020 Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

transform your
Blog For Managers For Engineers Let's talk

business.
Setup a call with our CEO markus@datarevenue.com

Our Team Strategy Day

Our Process Biotech & Pharma

Success Stories Digital Media

Blog eCommerce

Contact

2020 © Data Revenue Privacy & Terms

https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray 10/10

You might also like