DuckDB - What's The Hype About - This Was A Blog Post That I Already - by Oliver Molander - Better Programming

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

12/04/2024, 09:37 DuckDB — What’s the Hype About?.

s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

You are signed out. Sign in with your member account


(wa__@p__.com) to view other member-only stories. Sign in

DuckDB — What’s the Hype About?


This was a blog post that I already planned to write during the spring when I saw that
the hype around DuckDB started taking new heights. Since then the discussion
around DuckDB has only intensified in the developer and data engineering
community. I currently see two trends within the data community with high
engagement levels: DuckDB and Rust taking over data engineering. But what’s the
hype around DuckDB really about? Let’s scratch the surface a little bit.

Oliver Molander · Follow


Published in Better Programming
10 min read · Nov 16, 2022

Listen Share

DuckDB Github stars over time

A lot of today’s acceleration in the data space can be coupled with the explosive rise
of cloud data warehouses over the last few years. Cloud data warehouses have
become the cornerstone of data stacks: companies and organizations of all sizes use

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 1/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

a data warehouse to power analytics use cases. Snowflake’s meteoric rise —


culminated byYou
itsare
blockbuster IPO
signed out. Sign in in
withSeptember
your member2020 that became the largest
account
(wa__@p__.com)
software IPO in to view
history — has beenother
themember-only stories.
poster child Signtrend.
of this in

When looking at the 3 Vs of Big Data (Velocity, Volume, Variety), many in the data
community that I’ve spoken with lately have said that the most required dimension
during the past years has been velocity.

As noted by Mehdi Ouazza (Staff Data Engineer at Trade Republic) — the truth is that
everyone doesn’t have “Big” data — but a requirement for low latency consumption
from micro-service on data assets processed out of your OLTP database is a
common use case.

As Mehdi says, if one looks at some product trends (RocksDB, DuckDB, Clickhouse),
they all provide an easier interface for low-latency consumption. Even some cloud
data warehouse giants have invested in these applications, such as Snowflake
Unistore.

However, the current cloud data warehouse paradigm is still heavily skewed for a
client-server use case and ignores a growing segment of users. As noted by Tomasz
Tunguz (investor at Redpoint Ventures):

“Most workloads aren’t massive. Instead of requiring a scale-out database in the sky, most
analyses are faster with an optimized database on your computer that can leverage the
cloud when needed.”

DuckDB is changing this.

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 2/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

You are signed out. Sign in with your member account


(wa__@p__.com) to view other member-only stories. Sign in

Google Trends search data for “DuckDB”

As visible through the Google Trends search data above — during the past few
months there has been a growing discussion and palpable hype around DuckDB in
the data community.

The growing momentum


The growing momentum behind DuckDB becomes evident just by looking at posts
on social media. E.g. Robert Sahlin (Data Engineering Lead at MatHem), noted the
following on LinkedIn back in July:

“I’ve heard a lot of good things about DuckDB lately and have found podcasts with both
Jordan Tigani (founder of MotherDuck and BigQuery celebrity) and Hannes Mühleisen
(creator founder of DuckDB Labs) really good. Hence I had to give it a try. My first
program was to create a DuckDB table by reading directly from a BigQuery table using the
BigQuery Storage Read API since it supports arrow tables (and no compute). Turned out
to be really easy, sharing as a gist. Can’t wait to experiment some more with DuckDB and
with bigger data volumes, it sure has huge potential.”

And Robert is definitely not the only one excited about DuckDB when reading social
media posts (I recommend just doing a search on the hashtag #duckdb on LinkedIn
or Twitter). E.g. Abhishek Choudhary (Senior Lead Data Engineer Bayer) recently
wrote the following on LinkedIn:

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 3/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

“Opinion: One of the most exciting new technologies for Data Engineering/ Data Science is
DuckDB. DuckDB is insanely
You are signed out.fast
Signand with
in with Apache
your Arrow,
member accountthe duo is capable of delivering
(wa__@p__.com) to view other member-only stories. Sign in
astonishing results. Another important point behind DuckDB is it’s simple. It doesn’t
claim any groundbreaking stuff but sticks to the core of simple and faster data access.”

However, my favorite social media comment about DuckDB is most likely by Josh
Wills, in a Twitter thread that discusses the “How Snowflake fails” blog post by the
always entertaining Benn Stancil (I recommend subscribing to his Substack):

Find below some additional screenshots of tweets showcasing the interest that
DuckDB currently catalyzes in the data community:

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 4/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

You are signed out. Sign in with your member account


(wa__@p__.com) to view other member-only stories. Sign in

Image source Madrona

Building a managed solution on top of DuckDB


It’s a pretty classic playbook — take an open source tool showcasing momentum and
build a service on top of it. E.g. Databricks did this with Spark and Confluent with
Kafka.

Jordan Tigani, long-time product leader of BigQuery at Google (a BigQuery celebrity


as noted by Robert Sahlin earlier), announced in May that he’s co-founding a
serverless cloud version of DuckDB called MotherDuck. Joining him is his Google
colleague Tino Tereshko.

Besides MotherDuck, we have DuckDB Labs, which is a commercial company


formed by Hannes Mühleisen and the other creators of DuckDB in July 2021 to
provide support, custom extensions, and even custom versions of the product as a
way to monetize it.

As Lauren Balik noted in her “6 Reality-Based Predictions for Data in 2023” blog post
— venture capitalists and data professionals are right to be flocking to DuckDB.

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 5/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

This interest materialized yesterday with MotherDuck announcing their $47.5M


funding roundYou
ledareby e.g. out.
signed a16z (early
Sign investors
in with in e.g.
your member Databricks) and Redpoint
account
(wa__@p__.com) to view other member-only stories. Sign in
Ventures (early investors in e.g. Snowflake). MotherDuck and DuckDB Labs also
announced a strategic partnership at the same time.

Jordan Tigani (co-founder at MotherDuck) commented the following to TechCrunch


when announcing the funding round:

“Users want easy and fast answers to their questions — they don’t want to wait for the
cloud… The fact is that a modern laptop is faster than a modern data warehouse. Cloud
data vendors are focused on the performance of 100TB queries, which is not only
irrelevant for the vast majority of users, but also distracts from vendors’ ability to deliver a
great user experience.”

But what’s this hype all about? Let’s scratch the surface a little bit.
DuckDB is an easy-to-use open source in-process OLAP database (that processes
data in memory and doesn’t require a dedicated server/service) — described by
many in simplified terms as the SQLite equivalent for analytical OLAP workloads.

On HackerNoon, it was once described as “mutant offspring of SQLite and Redshift”.

As noted by the MoterDuck team, as an in-process database, DuckDB is a storage


and compute engine that enables developers, data scientists, data engineers and
data analysts to power their code with extremely fast analyses using plain SQL.
Further, DuckDB has the capability to analyze data where it might live, e.g. on the
laptop or in the cloud. Additionally, DuckDB comes with a simple CLI for quick
prototyping — without the need for setup, permissions, creating and managing
tables, etc.

Based on reading threads on e.g. HackerNews, Reddit and Twitter, there seems to be
a lot to like about DuckDB, e.g.:

Its performance for analytical workloads on single-node machines seems to be


impressive and the setup is pain-free (you can technically start exploring
DuckDB within 5 minutes).

DuckDB is embeddable — like SQLite — and is optimized for analytics. The big
deal here is the embeddable part (like a library without bringing in the typical

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 6/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

PostgreSQL dependency), eliminating the network latency you usually get when
talking to a database.
You are signed out. Sign in with your member account
(wa__@p__.com) to view other member-only stories. Sign in
DuckDB has also really low deployment effort — `pip install duckdb` and you are
off to the races.

Further DuckDB is fast — compared to querying Postgres, DuckDB is 80X faster


and when benchmarking other systems we can see similarly impressive results.

These are some of the reasons DuckDB has witnessed impressive growth over the
past 12 months.

In practice, any CPU can be mobilized to perform powerful analytics via DuckDB.
Further, DuckDB is portable and modular, with no external dependencies. In
concrete terms, this means that you can run DuckDB on a cloud virtual machine, in
a cloud function, in the browser, or on your laptop as mentioned prior.
Let’s take a step back
In the following section, I’m borrowing heavily Kojo Osei’s (investor at Matrix
Partners) great blog post from June about DuckDB.

As Kojo mentions, an emerging category of data warehouses sits at the intersection


of analytical queries and embedded deployments. To illustrate why this is so
compelling, he catgeorizes databases along two axes:

Database workload types (image source Kojo Osei)

As visible above and noted by Kojo, current databases are optimized for analytical or
transactional workloads. Analytical workloads — also called Online Analytical
Processing (OLAP) — are complex queries on historical data. For example, you may
want to analyze user signups broken down by demographics such as age and
location. On the other hand, transactional workloads — also referred to as Online
Transactional Processing (OLTP) — are optimized for quick real-time reads and
writes.

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 7/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

Let’s move ahead to deployment types.


You are signed out. Sign in with your member account
(wa__@p__.com) to view other member-only stories. Sign in

Database deployment types (image source Kojo Osei)

As visible above and noted by Kojo, current database technologies are deployed as
stand-alone or embedded solutions. Stand-alone databases are typically deployed in
a client-server paradigm. The database sits on a centralized server and is queried by
a client application. Embedded databases run within the host process of whatever
application is accessing the database.

Now some magic. When we merge these two axes we can see an innovation gap! As
Kojo underlines, current innovation in OLAP databases has focused on stand-alone
OLAP databases such as Snowflake, ClickHouse, and Redshift (don’t know why he
left out BigQuery). This has led us to a situation where embedded analytics use cases
have been overlooked and underserved. DuckDB is changing this.

Image source Kojo Osei

Use cases for DuckDB


Airbyte has in their glossary a short summary of example use cases for DuckDB:

Ultra-fast analytical use-case locally. E.g., a Taxi example in the Airbyte glossary
includes a 10 Year, 1.5 Billion row Taxi data example that still works on a laptop.
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 8/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

See benchmarks here.


You are signed out. Sign in with your member account
It can be used as an SQLtowrapper
(wa__@p__.com) view otherwith zero copies
member-only stories.(on
Signtop
in of parquets in S3).

Bring your data to the users instead of having big roundtrips and latency by
doing REST calls. Instead, you can put data inside the client. You can do 60
frames per second as data is where the query is.

DuckDB on Kubernetes for a zero-copy layer to read S3 in the Data Lake!


Inspired by this Tweet. The cheapest and fastest option to get started.

Based on documentation, DuckDB should be used when:

Processing and storing tabular datasets, e.g. from CSV or Parquet files

Doing interactive data analysis, e.g. joining & aggregate multiple large tables

Having concurrent large changes, to multiple large tables, e.g. appending rows,
adding/removing/updating columns

Having large result set transfer to client

Based on documentation, DuckDB should not be used when:

Having high-volume transactional use cases (e.g. tracking e-commerce orders)

Writing to a single database from multiple concurrent processes

Having large client/server installations for centralized enterprise data


warehousing

To learn more about use cases for DuckDB, listen to this The Data Engineering
Podcast episode with Hannes Mühleisen, one of the creators of DuckDB (use case
discussion starts at ca 14min).

Final thoughts
There are many database management systems out there. But as noted by the
DuckDB creators: there is no one-size-fits-all database system. All take different
trade-offs to better adjust to specific use cases. DuckDB is no different.

When you think about selecting a database engine for your project you typically
consider options focused on serving multiple concurrent users. Sometimes what

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 9/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

you really need is an embedded database that is blazing fast for single-user
workloads. Enter DuckDB.
You are signed out. Sign in with your member account
(wa__@p__.com) to view other member-only stories. Sign in
Further, it seems like DuckDB also allows an entire community of SQL enthusiasts to
be instantly productive in Python without ever learning more than very basic
Pandas. There’s a growing number of data community members who never use
Pandas for anything complex anymore because they favor SQL.

Luis Velasco (Data Solution Lead at Google) summarized well on LinkedIn a few
months back why he thinks DuckDB is a big deal:

1. We are living in the great disaggregation of the central data platforms era. The
more extreme compute decentralized paradigm I can think of is a grid of laptops.
The combo of technology like parquet + pyaArrow with vectorized execution makes
it efficient to query large datasets in personal devices.

2. With increased data literacy and improved coding skills in the vast majority of
data workers, insight consumption is far from static — dashboards — but
exploratory and self-service. So I envision data analysts accessing data in cloud
storage, running embedded analysis locally with duckDB.

3. SQL is more alive than ever before — period.

4. Zero deployment effort — `pip install duckdb` and you are in

5. Open Source — There is a vibrant community forming, with support in key pieces
like pandas, dbt or apachesuperset , not to mention new startups like DuckDB Labs
and MotherDuck

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 10/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

You are signed out. Sign in with your member account


(wa__@p__.com) to view other member-only stories. Sign in

What do you think about the future of DuckDB?

PS: I recommend watching this “What is DuckDB” video by The Seattle Data Guy
where he discusses together with Joseph Machado (Senior Data Engineer at
LinkedIn) about how DuckDB has entered the world of data by storm.

Data
Open Engineering
in app Duckdb Database Startup Programming Sign up Sign in

Search

Follow

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 11/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

Written by Oliver Molander


You are
924 Followers · Writer forsigned
Betterout. Sign in with your member account
Programming
(wa__@p__.com) to view other member-only stories. Sign in
Preaching about the realities and possibilities of data and machine learning. Founder & investor.

More from Oliver Molander and Better Programming

Oliver Molander

Gartner’s AI Hype Cycle — Way Passed its Due Date… And are We
Entering a Classical ML Winter?
When looking at Gartner’s 2023 Hype Cycle for Artificial Intelligence one can only come to one
conclusion: the hype cycle itself has…

11 min read · Sep 6, 2023

748 9

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 12/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

You are signed out. Sign in with your member account


(wa__@p__.com) to view other member-only stories. Sign in

Benoit Ruiz in Better Programming

Advice From a Software Engineer With 8 Years of Experience


Practical tips for those who want to advance in their careers

22 min read · Mar 20, 2023

14.9K 274

Sami Maameri in Better Programming

Building a Multi-document Reader and Chatbot With LangChain and


ChatGPT
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 13/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

The best part? The chatbot will remember your chat history
You are signed out. Sign in with your member account
17 min read · May 20, 2023
(wa__@p__.com) to view other member-only stories. Sign in

1.91K 17

Oliver Molander

The 2023 edition of the Machine Learning, AI and Data Landscape — a


quick analysis
The latest MAD (Machine Learning, AI and Data) Landscape is out again! A huge kudos for the
tremendous work Matt Turck and the team…

4 min read · Feb 22, 2023

79 2

See all from Oliver Molander

See all from Better Programming

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 14/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

You are signed out. Sign in with your member account


(wa__@p__.com) to view other member-only stories. Sign in

Recommended from Medium

Somnath Singh in Level Up Coding

The Era of High-Paying Tech Jobs is Over


The Death of Tech Jobs.

· 14 min read · Mar 31, 2024

4.8K 140

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 15/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

You are signed out. Sign in with your member account


(wa__@p__.com) to view other member-only stories. Sign in

Liu Zuo Lin

You’re Decent At Python If You Can Answer These 7 Questions Correctly


# No cheating pls!!

· 6 min read · Mar 6, 2024

2.1K 17

Lists

General Coding Knowledge


20 stories · 1102 saves

Business 101
25 stories · 835 saves

Growth Marketing
11 stories · 98 saves

Coding & Development


11 stories · 554 saves

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 16/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

You are signed out. Sign in with your member account


(wa__@p__.com) to view other member-only stories. Sign in

Pinterest Engineering in Pinterest Engineering Blog

How we built Text-to-SQL at Pinterest


Adam Obeng | Data Scientist, Data Platform Science; J.C. Zhong | Tech Lead, Analytics
Platform; Charlie Gu | Sr. Manager, Engineering

8 min read · Apr 2, 2024

1.3K 15

Leonie Monigatti in Towards Data Science

Pandas vs. Polars: A Syntax and Speed Comparison


https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 17/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

Understanding the major differences between the Python libraries Pandas and Polars for Data
Science You are signed out. Sign in with your member account
(wa__@p__.com) to view other member-only stories. Sign in
· 7 min read · Jan 11, 2023

680 8

Karen Zhang in Data Engineer Things

An Intro to DuckDB: The SQLite for Analytics


When, Why, and How You Should Consider Using DuckDB

· 6 min read · Nov 12, 2023

258 1

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 18/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming

You are signed out. Sign in with your member account


(wa__@p__.com) to view other member-only stories. Sign in

Jan Kadlec in GoodData Developers

DuckDB Meets Apache Arrow


You may have heard about DuckDB, Apache Arrow, or both. In this article, I’ll tell you about how
we (GoodData) are the first analytics (BI)…

6 min read · Mar 25, 2024

119 1

See more recommendations

https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 19/19

You might also like