Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Delta Lake vs.

Parquet
Abhinav Prakash · Follow
10 min read · Jan 24, 2024

137 3

If Delta lake tables also use Parquet files to store data, how are they different
(and better) than vanilla Parquet tables?

This was a confusion that clouded my understanding of the Delta Lake. In


this article, we will start from the basics. I hope by the end of this article, you
will understand how Delta Lake is better than a Data Lake (even one that
uses Parquet files to store data).

What is a Delta Lake table?

Delta Lake is open source software that extends Parquet data files with a file-based
transaction log for ACID transactions and scalable metadata handling. A Delta
Lake table is a table created and managed using Delta Lake technology.

Now lets move on to the next logical question -

What advantage do we get when our tables are using the ‘Delta Lake
technology’ as opposed to using vanilla Parquet files?

Delta Lake has all the benefits of Parquet tables and many other critical
features. To understand this, lets compare the basic structure of a Parquet
table and a Delta table.

Characteristics of Parquet files:


Parquet is an immutable, binary, columnar file format with several
advantages compared to a row-based format like CSV. Here are the core
advantages of Parquet files compared to CSV:

The columnar nature of Parquet files allows query engines to cherry-pick


individual columns. For row-based file formats, query engines must read
all the columns, even those irrelevant to the query.

Parquet files contain schema information in the metadata, so the query


engine doesn’t need to infer the schema / the user doesn’t need to
manually specify the schema when reading the data.

Columnar file formats like Parquet files are more compressible than row-
based file formats.

Parquet files store data in row groups. Each row group has min/max
statistics for each column. Parquet allows query engines to skip over
entire row groups for specific queries, which can be a huge performance
gain when reading data.

Parquet files are immutable, discouraging the antipattern of manually


updating source data.

A single Parquet file for a small dataset would work much better than a csv
file. But data engineers often deal with large datasets split across multiple
Parquet files. Managing multiple Parquet files isn’t great.

Here’s what a bunch of Parquet files look like on disk.

some_folder/
file1.parquet
file2.parquet

fileN.parquet

Lets look at some of the challenges of storing data in multiple Parquet files:

No ACID transactions for Parquet data lakes

It is not easy to delete rows from Parquet tables

No DML transactions

There is no change data feed

Slow file listing overhead

Expensive footer reads to gather statistics for file skipping

There is no way to rename, reorder, or drop columns without rewriting


the whole table

Now that we have seen the capabilities as well as the limitations of a table
which uses Parquet files, we can understand that a Delta Lake table builds on
the capabilities of the parquet files while at the same time addressing its
limitation.

Delta Lake makes managing Parquet tables easier and faster. Delta Lake is
also optimized to prevent you from corrupting your data table. Let’s look at
how Delta Lakes are structured to understand how it provides these features.

The basic structure of a Delta Lake


Delta Lake stores metadata in a transaction log and table data in Parquet
files. Here are the contents of a Delta table.

some_folder/
_delta_log
00.json
01.json

n.json
file1.parquet
file2.parquet

fileN.parquet

Here’s a visual representation of a Delta table:

Now that we understand and can visualize how a vanilla Parquet tables
stores and manages data, and how a Delta Table stores and manages data, we
can look at how Delta Tables handles situations where a Parquet table falls
short.

1. Delta Lake vs. Parquet: file listing


When you want to read a Parquet lake, you must perform a file listing
operation and then read all the data. You can’t read the data till you’ve listed
all the files. See the following illustration:

File listing operations are slower for data in the cloud.

Delta Lakes store the paths to Parquet files in the transaction log to avoid
performing an expensive file listing. Delta Lake doesn’t need to list all Parquet files
in the cloud object store to fetch their paths. It can just look up the file paths in the
transaction log. Hence It’s better to rely on the transaction log to get the paths to
files in a table instead of performing a file listing operation.

2. Delta Lake vs. Parquet: small file problem


Big data systems that are incrementally updated can create a lot of small
files.

Data processing engines don’t perform well when reading datasets with Top highlight

many small files. You typically want files that are between 64 MB and 1 GB.
You don’t want tiny 1 KB files that require excessive I/O overhead.

Data Engineers will commonly want to compact the small files into larger
files with a process referred to as “small file compaction”. If you’re working
with a plain vanilla Parquet data lake, you need to write the small file
compaction code yourself.

With Delta Lake, you can simply run the OPTIMIZE command, and Delta Lake will
handle the small file compaction for you.

3. Delta Lake vs. Parquet: ACID transactions


Suppose you’re appending a large amount of data to an existing Parquet lake,
and your cluster dies in the middle of the write operation. Then, you’ll have
several partially written Parquet files in your table.

The partially written files will break any subsequent read operations. The
compute engine will try to read in the corrupt files and error out. You’ll need
to manually identify all the corrupted files and delete them to fix your lake.

Delta Lake supports transactions, so you’ll never corrupt a Delta Lake by a write
operation that errors out midway through. If a cluster dies when writing to a
Delta table, the Delta Lake will simply ignore the partially written files, and
subsequent reads won’t break.

4. Delta Lake vs. Parquet: file skipping


Let us explore a scenario where your table has 1000 parquet files. Your query
only needs to access data present in 10 of the 1000 files.

Since Parquet files store metadata for row groups in the footer, fetching all
the footers and building the file-level metadata for the entire table will be a
slow and tedious task. It requires a file-listing operation, which we have
discussed to be slow. So, Parquet doesn’t support file-level skipping (but row-
group filtering is possible).

Delta tables store metadata information about the underlying Parquet files in the
transaction log. It’s quick to read the transaction log of a Delta table and figure
out what files can be skipped, and hence supports file skipping.

5. Delta Lake vs. Parquet: predicate pushdown filtering


Let us explore a scenario where you’re running a filtering operation and
would like to find all values where col4=65 . If there is a Parquet file with a
max col4 value of 34, you know that file doesn’t have any relevant data for
your query. You can skip it entirely. This works well when we have a single
Parquet file to read.
If you have 10,000 files, you don’t want to have to read in all the file footers,
gather the statistics for the overall lake, and then run the query. That’s way
too much overhead.

Dela Lake stores the metadata statistics in the transaction log, so the query engine
doesn’t need to read all the individual files and gather the statistics before running
a query. It’s way more efficient to fetch the statistics from the transaction log.

6. Delta Lake vs. Parquet: Z Order indexing


Skipping is much more efficient when the data is Z Ordered. More data can
be skipped when similar data is co-located.

It’s not easy to Z Order the data in a Parquet table while it’s easy to Z Order
the data in a Delta table.

7. Delta Lake vs. Parquet: renaming columns


Parquet files are immutable, so you can’t modify the file to update the
column name. If you want to change the column name, read it into a
DataFrame, change the name, and then rewrite the entire file. Renaming a
column can be an expensive computation and there isn’t a quick way to
update the column name of a Parquet table.

Delta Lake abstracted the concept of physical column names and logical column
names. The physical column name is the actual column name in the Parquet file.
The logical column name is the column name humans use when referencing the
column.

Delta Lake lets users quickly rename columns by changing the logical column
name, a pure-metadata operation. It’s just a simple entry in the Delta transaction
log.

8. Delta Lake vs. Parquet: dropping columns


To drop a column in a Parquet lake tables, you are required to read all the
data, drop the column with a query engine, and then rewrite all the data. It’s
an extensive computation for a relatively small operation.

Delta Lake also allows you to drop a column quickly. You can add an entry to the
Delta transaction log and instruct Delta to ignore columns on future operations —
it’s a pure metadata operation.

9. Delta Lake vs. Parquet: schema enforcement


You usually want to allow appending DataFrames with a schema that
matches the existing table and to reject appends of DataFrames with
schemas that don’t match.

With Parquet tables, you need to code this schema enforcement manually.
You can append DataFrames with any schema to a Parquet table by default
(unless they’re registered with a metastore and schema enforcement is
provided via the metastore).

Delta Lakes have built-in schema enforcement, which saves you from costly errors
that can corrupt your Delta Lake. You can also bypass schema enforcement in
Delta tables and change the schema of a table over time.

10. Delta Lake vs. Parquet: schema evolution


Sometimes, you’d like to add additional columns to your Delta Lake. Perhaps
you rely on a data vendor, and they’ve added a new column to your data feed.
You’d prefer not to rewrite all the existing data with a blank column just so
that you can add a new column to your table. You’d just like to write the new
data with the additional column and keep all the existing data as is.

Delta Lake allows for schema evolution so you can seamlessly add new columns to
your dataset without running big computations.

11. Delta Lake vs. Parquet: check constraints


You may want to apply custom SQL checks to columns to ensure data
appended to a table is a specified form.

Like in a scenario where you want to ensure that the string matches a certain
regular expression pattern and that a column does not contain NULL values.

Parquet tables don’t support check constraints like Delta Lake does.

12. Delta Lake vs. Parquet: versioned data


Delta tables can have many versions, and users can easily “time travel”
between the different versions. This comes in handy for regulatory
requirements, audit purposes, experimentation, and rolling back mistakes.

Versioned data also impacts how engines execute certain transactions. For
example, when you “overwrite” a Delta table, you don’t physically remove files
from storage. You simply mark the existing files as deleted, but don’t actually
delete them. This is referred to as a “logical delete”.

Parquet tables don’t support versioned data. When you remove data from a
Parquet table, you actually delete it from storage, which is referred to as a
“physical deletes”. If you overwrite a Parquet table, it is an irreversible error
while it’s easy to undo an overwrite transaction in a Delta table.

Versioned data also allows you to easily switch between different versions of your
Delta Lake, which is referred to as time travel.

Parquet tables don’t support time travel.

13. Delta Lake vacuum command


When using Delta Tables, you can remove files no longer referenced by a
Delta table and are older than the retention threshold by running the vacuum

command on the table. vacuum is not triggered automatically and the default
retention threshold for the files is 7 days

It won’t delete all of the data older than 7 days old, of course. If there is data
still required in the current version of the Delta Lake that’s “old,” it won’t get
deleted.

14. Delta Lake rollback


Delta Lake also makes it easy to reset your entire lake to an earlier version.
Let’s say you inserted some data on a Wednesday and realized it was
incorrect. You can easily roll back the entire Delta Lake to the state on
Tuesday, effectively undoing all the mistakes you made on Wednesday.

RESTORE TABLE your_table TO VERSION AS OF 1

# using pyspark command restoreToVersion


from delta.tables import *

deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")


deltaTable.restoreToVersion(1)

15. Delta Lake vs. Parquet: deleting rows


Delta Lake makes it easy to perform a minimal delete operation, whereas it’s
not easy to delete rows from a Parquet lake.

Let us look at a scenario where you have a user who would like their account
deleted and all their data removed from your systems. You have some of
their data stored in your table. Your table has 50,000 files, and this particular
customer has data in 10 of those files.
If you have a Parquet table, the only convenient operation is to read all the
data, filter out the data for that particular user, and then rewrite the entire
table. That will take a long time!

Delta Lake makes it easy to run a delete command and will efficiently rewrite the
ten impacted files without the customer data. Delta Lake also makes it easy to
write a file that flags the rows that are deleted (deletion vectors), which makes this
operation run even faster.

16. Delta Lake vs. Parquet: merge transactions


Search Write
Delta Lake provides a powerful merge command that allows you to update
rows, perform upserts, build slowly changing dimension tables, and more.

If you work with a Parquet table, you don’t have any access to a merge
command. You need to implement all the low level merge details yourself,
which is challenging and time consuming.

17. Delta Lake Change Data Feed


The Delta Lake Change Data Feed (CDF) allows you to automatically track
Delta table row-level changes. This is useful for auditing, quality control,
debugging, and intelligent downstream updates.

Conclusion
These were all the ways a Delta Lake table is better than Parquet table, even
though both the tables store data in Parquet file format. I hope this helped
you understand the differences and appreciate how a Lakehouse
architecture (built on the Delta Lake technology) helps us by combining the
best of the both worlds — data warehouses and data lakes.
References:

https://docs.delta.io/latest/index.html

Delta Lake vs. Parquet Comparison


This post compares the stengths and weaknesses of Delta Lake vs
Parquet.
delta.io

Data Warehouse vs. Data Lake vs. Data Lakehouse: An Overview


of Three Cloud Data Storage Patterns
What are the differences between popular data storage
architectures? Check out our data warehouse vs data lake vs data…
www.striim.com

Delta Lake Lakehouse Architecture Databricks Delta Lake Advantages Of Delta Lake

Delta Lake Vs Parquet

Written by Abhinav Prakash Follow

48 Followers

Data Engineer
More from Abhinav Prakash

Abhinav Prakash Abhinav Prakash

Databricks: A comprehensive Comprehensive Business


optimization guide Intelligence (BI) Architecture
I have been using Databricks for ETL Let’s talk BI?
workloads for 4 years now. In these 4 years, I
have come across optimization techniques in
14 min
bits read · Feb 2, 2024
and… 3 min read · Feb 9, 2024

22

Abhinav Prakash Abhinav Prakash

Six point checklist for Spark job Data stores over the years
optimization Lets discuss the different data storage
I have been scouring the internet to try and architectures over the years to understand
understand the best ways to optimize a spark which is the best one for you. Also, why is it
job. Here, I am summarizing my findings. best?
3 min read · Mar 22, 2023
This… 3 min read · Mar 6, 2023

41 8

See all from Abhinav Prakash

Recommended from Medium


Ahmed Sayed Dave Melillo in Towards Data Science

Build Scalable Data Pipelines in Building a Data Platform in 2024


Python Using DLT How to build a modern, scalable data
platform to power your analytics and data
science projects (updated)
13 min read · Feb 18, 2024 9 min read · Feb 5, 2024

326 2 2.1K 31

Lists

Staff Picks Stories to Help You Level-Up


601 stories · 822 saves at Work
19 stories · 519 saves

Self-Improvement 101 Productivity 101


20 stories · 1489 saves 20 stories · 1372 saves
Nilay Shah in Transforming Insights into Impact Aruna Pattam

Data Platforms : Good Architect— Generative AI in Data Engineering


Bad Architect In the evolving landscape of data
The Journey from Data Engineering to engineering, the integration of Generative AI
Mastery in Data Architecture is no longer a futuristic concept—it’s a
present-day…
10 min read · Jan 11, 2024 6 min read · Nov 2, 2023

365 1 197 4

Ganesh Chandrasekaran Shantanu Tripathi

Data Engineering Design Patterns Troubleshooting Slow Spark Job: 5


Design patterns are not just for Software Key Areas to Investigate
engineers. Let's discuss some popular Data Spark is supposed to reduce ETL time by
engineering design patterns that help you leveraging the concept of efficient
build modern… parallelism. If your job isn’t doing so, let’s
· 3 min read · Jan 12, 2024 5 · Jan 5, 2024
min read5…
discuss

302 3 108

See more recommendations

Help Status About Careers Blog Privacy Terms Text to speech Teams

You might also like