Professional Documents
Culture Documents
DuckDB - What's The Hype About - This Was A Blog Post That I Already - by Oliver Molander - Better Programming
DuckDB - What's The Hype About - This Was A Blog Post That I Already - by Oliver Molander - Better Programming
DuckDB - What's The Hype About - This Was A Blog Post That I Already - by Oliver Molander - Better Programming
s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
Listen Share
A lot of today’s acceleration in the data space can be coupled with the explosive rise
of cloud data warehouses over the last few years. Cloud data warehouses have
become the cornerstone of data stacks: companies and organizations of all sizes use
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 1/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
When looking at the 3 Vs of Big Data (Velocity, Volume, Variety), many in the data
community that I’ve spoken with lately have said that the most required dimension
during the past years has been velocity.
As noted by Mehdi Ouazza (Staff Data Engineer at Trade Republic) — the truth is that
everyone doesn’t have “Big” data — but a requirement for low latency consumption
from micro-service on data assets processed out of your OLTP database is a
common use case.
As Mehdi says, if one looks at some product trends (RocksDB, DuckDB, Clickhouse),
they all provide an easier interface for low-latency consumption. Even some cloud
data warehouse giants have invested in these applications, such as Snowflake
Unistore.
However, the current cloud data warehouse paradigm is still heavily skewed for a
client-server use case and ignores a growing segment of users. As noted by Tomasz
Tunguz (investor at Redpoint Ventures):
“Most workloads aren’t massive. Instead of requiring a scale-out database in the sky, most
analyses are faster with an optimized database on your computer that can leverage the
cloud when needed.”
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 2/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
As visible through the Google Trends search data above — during the past few
months there has been a growing discussion and palpable hype around DuckDB in
the data community.
“I’ve heard a lot of good things about DuckDB lately and have found podcasts with both
Jordan Tigani (founder of MotherDuck and BigQuery celebrity) and Hannes Mühleisen
(creator founder of DuckDB Labs) really good. Hence I had to give it a try. My first
program was to create a DuckDB table by reading directly from a BigQuery table using the
BigQuery Storage Read API since it supports arrow tables (and no compute). Turned out
to be really easy, sharing as a gist. Can’t wait to experiment some more with DuckDB and
with bigger data volumes, it sure has huge potential.”
And Robert is definitely not the only one excited about DuckDB when reading social
media posts (I recommend just doing a search on the hashtag #duckdb on LinkedIn
or Twitter). E.g. Abhishek Choudhary (Senior Lead Data Engineer Bayer) recently
wrote the following on LinkedIn:
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 3/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
“Opinion: One of the most exciting new technologies for Data Engineering/ Data Science is
DuckDB. DuckDB is insanely
You are signed out.fast
Signand with
in with Apache
your Arrow,
member accountthe duo is capable of delivering
(wa__@p__.com) to view other member-only stories. Sign in
astonishing results. Another important point behind DuckDB is it’s simple. It doesn’t
claim any groundbreaking stuff but sticks to the core of simple and faster data access.”
However, my favorite social media comment about DuckDB is most likely by Josh
Wills, in a Twitter thread that discusses the “How Snowflake fails” blog post by the
always entertaining Benn Stancil (I recommend subscribing to his Substack):
Find below some additional screenshots of tweets showcasing the interest that
DuckDB currently catalyzes in the data community:
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 4/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
As Lauren Balik noted in her “6 Reality-Based Predictions for Data in 2023” blog post
— venture capitalists and data professionals are right to be flocking to DuckDB.
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 5/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
“Users want easy and fast answers to their questions — they don’t want to wait for the
cloud… The fact is that a modern laptop is faster than a modern data warehouse. Cloud
data vendors are focused on the performance of 100TB queries, which is not only
irrelevant for the vast majority of users, but also distracts from vendors’ ability to deliver a
great user experience.”
But what’s this hype all about? Let’s scratch the surface a little bit.
DuckDB is an easy-to-use open source in-process OLAP database (that processes
data in memory and doesn’t require a dedicated server/service) — described by
many in simplified terms as the SQLite equivalent for analytical OLAP workloads.
Based on reading threads on e.g. HackerNews, Reddit and Twitter, there seems to be
a lot to like about DuckDB, e.g.:
DuckDB is embeddable — like SQLite — and is optimized for analytics. The big
deal here is the embeddable part (like a library without bringing in the typical
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 6/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
PostgreSQL dependency), eliminating the network latency you usually get when
talking to a database.
You are signed out. Sign in with your member account
(wa__@p__.com) to view other member-only stories. Sign in
DuckDB has also really low deployment effort — `pip install duckdb` and you are
off to the races.
These are some of the reasons DuckDB has witnessed impressive growth over the
past 12 months.
In practice, any CPU can be mobilized to perform powerful analytics via DuckDB.
Further, DuckDB is portable and modular, with no external dependencies. In
concrete terms, this means that you can run DuckDB on a cloud virtual machine, in
a cloud function, in the browser, or on your laptop as mentioned prior.
Let’s take a step back
In the following section, I’m borrowing heavily Kojo Osei’s (investor at Matrix
Partners) great blog post from June about DuckDB.
As visible above and noted by Kojo, current databases are optimized for analytical or
transactional workloads. Analytical workloads — also called Online Analytical
Processing (OLAP) — are complex queries on historical data. For example, you may
want to analyze user signups broken down by demographics such as age and
location. On the other hand, transactional workloads — also referred to as Online
Transactional Processing (OLTP) — are optimized for quick real-time reads and
writes.
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 7/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
As visible above and noted by Kojo, current database technologies are deployed as
stand-alone or embedded solutions. Stand-alone databases are typically deployed in
a client-server paradigm. The database sits on a centralized server and is queried by
a client application. Embedded databases run within the host process of whatever
application is accessing the database.
Now some magic. When we merge these two axes we can see an innovation gap! As
Kojo underlines, current innovation in OLAP databases has focused on stand-alone
OLAP databases such as Snowflake, ClickHouse, and Redshift (don’t know why he
left out BigQuery). This has led us to a situation where embedded analytics use cases
have been overlooked and underserved. DuckDB is changing this.
Ultra-fast analytical use-case locally. E.g., a Taxi example in the Airbyte glossary
includes a 10 Year, 1.5 Billion row Taxi data example that still works on a laptop.
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 8/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
Bring your data to the users instead of having big roundtrips and latency by
doing REST calls. Instead, you can put data inside the client. You can do 60
frames per second as data is where the query is.
Processing and storing tabular datasets, e.g. from CSV or Parquet files
Doing interactive data analysis, e.g. joining & aggregate multiple large tables
Having concurrent large changes, to multiple large tables, e.g. appending rows,
adding/removing/updating columns
To learn more about use cases for DuckDB, listen to this The Data Engineering
Podcast episode with Hannes Mühleisen, one of the creators of DuckDB (use case
discussion starts at ca 14min).
Final thoughts
There are many database management systems out there. But as noted by the
DuckDB creators: there is no one-size-fits-all database system. All take different
trade-offs to better adjust to specific use cases. DuckDB is no different.
When you think about selecting a database engine for your project you typically
consider options focused on serving multiple concurrent users. Sometimes what
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 9/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
you really need is an embedded database that is blazing fast for single-user
workloads. Enter DuckDB.
You are signed out. Sign in with your member account
(wa__@p__.com) to view other member-only stories. Sign in
Further, it seems like DuckDB also allows an entire community of SQL enthusiasts to
be instantly productive in Python without ever learning more than very basic
Pandas. There’s a growing number of data community members who never use
Pandas for anything complex anymore because they favor SQL.
Luis Velasco (Data Solution Lead at Google) summarized well on LinkedIn a few
months back why he thinks DuckDB is a big deal:
1. We are living in the great disaggregation of the central data platforms era. The
more extreme compute decentralized paradigm I can think of is a grid of laptops.
The combo of technology like parquet + pyaArrow with vectorized execution makes
it efficient to query large datasets in personal devices.
2. With increased data literacy and improved coding skills in the vast majority of
data workers, insight consumption is far from static — dashboards — but
exploratory and self-service. So I envision data analysts accessing data in cloud
storage, running embedded analysis locally with duckDB.
5. Open Source — There is a vibrant community forming, with support in key pieces
like pandas, dbt or apachesuperset , not to mention new startups like DuckDB Labs
and MotherDuck
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 10/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
PS: I recommend watching this “What is DuckDB” video by The Seattle Data Guy
where he discusses together with Joseph Machado (Senior Data Engineer at
LinkedIn) about how DuckDB has entered the world of data by storm.
Data
Open Engineering
in app Duckdb Database Startup Programming Sign up Sign in
Search
Follow
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 11/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
Oliver Molander
Gartner’s AI Hype Cycle — Way Passed its Due Date… And are We
Entering a Classical ML Winter?
When looking at Gartner’s 2023 Hype Cycle for Artificial Intelligence one can only come to one
conclusion: the hype cycle itself has…
748 9
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 12/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
14.9K 274
The best part? The chatbot will remember your chat history
You are signed out. Sign in with your member account
17 min read · May 20, 2023
(wa__@p__.com) to view other member-only stories. Sign in
1.91K 17
Oliver Molander
79 2
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 14/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
4.8K 140
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 15/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
2.1K 17
Lists
Business 101
25 stories · 835 saves
Growth Marketing
11 stories · 98 saves
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 16/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
1.3K 15
Understanding the major differences between the Python libraries Pandas and Polars for Data
Science You are signed out. Sign in with your member account
(wa__@p__.com) to view other member-only stories. Sign in
· 7 min read · Jan 11, 2023
680 8
258 1
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 18/19
12/04/2024, 09:37 DuckDB — What’s the Hype About?. This was a blog post that I already… | by Oliver Molander | Better Programming
119 1
https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196 19/19