Professional Documents
Culture Documents
Delta Lake vs. Parquet. If Delta Lake Tables Also Use Parquet - by Abhinav Prakash - Jan, 2024 - Medium
Delta Lake vs. Parquet. If Delta Lake Tables Also Use Parquet - by Abhinav Prakash - Jan, 2024 - Medium
Parquet
Abhinav Prakash · Follow
10 min read · Jan 24, 2024
137 3
If Delta lake tables also use Parquet files to store data, how are they different
(and better) than vanilla Parquet tables?
Delta Lake is open source software that extends Parquet data files with a file-based
transaction log for ACID transactions and scalable metadata handling. A Delta
Lake table is a table created and managed using Delta Lake technology.
What advantage do we get when our tables are using the ‘Delta Lake
technology’ as opposed to using vanilla Parquet files?
Delta Lake has all the benefits of Parquet tables and many other critical
features. To understand this, lets compare the basic structure of a Parquet
table and a Delta table.
Columnar file formats like Parquet files are more compressible than row-
based file formats.
Parquet files store data in row groups. Each row group has min/max
statistics for each column. Parquet allows query engines to skip over
entire row groups for specific queries, which can be a huge performance
gain when reading data.
A single Parquet file for a small dataset would work much better than a csv
file. But data engineers often deal with large datasets split across multiple
Parquet files. Managing multiple Parquet files isn’t great.
some_folder/
file1.parquet
file2.parquet
…
fileN.parquet
Lets look at some of the challenges of storing data in multiple Parquet files:
No DML transactions
Now that we have seen the capabilities as well as the limitations of a table
which uses Parquet files, we can understand that a Delta Lake table builds on
the capabilities of the parquet files while at the same time addressing its
limitation.
Delta Lake makes managing Parquet tables easier and faster. Delta Lake is
also optimized to prevent you from corrupting your data table. Let’s look at
how Delta Lakes are structured to understand how it provides these features.
some_folder/
_delta_log
00.json
01.json
…
n.json
file1.parquet
file2.parquet
…
fileN.parquet
Now that we understand and can visualize how a vanilla Parquet tables
stores and manages data, and how a Delta Table stores and manages data, we
can look at how Delta Tables handles situations where a Parquet table falls
short.
Delta Lakes store the paths to Parquet files in the transaction log to avoid
performing an expensive file listing. Delta Lake doesn’t need to list all Parquet files
in the cloud object store to fetch their paths. It can just look up the file paths in the
transaction log. Hence It’s better to rely on the transaction log to get the paths to
files in a table instead of performing a file listing operation.
Data processing engines don’t perform well when reading datasets with Top highlight
many small files. You typically want files that are between 64 MB and 1 GB.
You don’t want tiny 1 KB files that require excessive I/O overhead.
Data Engineers will commonly want to compact the small files into larger
files with a process referred to as “small file compaction”. If you’re working
with a plain vanilla Parquet data lake, you need to write the small file
compaction code yourself.
With Delta Lake, you can simply run the OPTIMIZE command, and Delta Lake will
handle the small file compaction for you.
The partially written files will break any subsequent read operations. The
compute engine will try to read in the corrupt files and error out. You’ll need
to manually identify all the corrupted files and delete them to fix your lake.
Delta Lake supports transactions, so you’ll never corrupt a Delta Lake by a write
operation that errors out midway through. If a cluster dies when writing to a
Delta table, the Delta Lake will simply ignore the partially written files, and
subsequent reads won’t break.
Since Parquet files store metadata for row groups in the footer, fetching all
the footers and building the file-level metadata for the entire table will be a
slow and tedious task. It requires a file-listing operation, which we have
discussed to be slow. So, Parquet doesn’t support file-level skipping (but row-
group filtering is possible).
Delta tables store metadata information about the underlying Parquet files in the
transaction log. It’s quick to read the transaction log of a Delta table and figure
out what files can be skipped, and hence supports file skipping.
Dela Lake stores the metadata statistics in the transaction log, so the query engine
doesn’t need to read all the individual files and gather the statistics before running
a query. It’s way more efficient to fetch the statistics from the transaction log.
It’s not easy to Z Order the data in a Parquet table while it’s easy to Z Order
the data in a Delta table.
Delta Lake abstracted the concept of physical column names and logical column
names. The physical column name is the actual column name in the Parquet file.
The logical column name is the column name humans use when referencing the
column.
Delta Lake lets users quickly rename columns by changing the logical column
name, a pure-metadata operation. It’s just a simple entry in the Delta transaction
log.
Delta Lake also allows you to drop a column quickly. You can add an entry to the
Delta transaction log and instruct Delta to ignore columns on future operations —
it’s a pure metadata operation.
With Parquet tables, you need to code this schema enforcement manually.
You can append DataFrames with any schema to a Parquet table by default
(unless they’re registered with a metastore and schema enforcement is
provided via the metastore).
Delta Lakes have built-in schema enforcement, which saves you from costly errors
that can corrupt your Delta Lake. You can also bypass schema enforcement in
Delta tables and change the schema of a table over time.
Delta Lake allows for schema evolution so you can seamlessly add new columns to
your dataset without running big computations.
Like in a scenario where you want to ensure that the string matches a certain
regular expression pattern and that a column does not contain NULL values.
Parquet tables don’t support check constraints like Delta Lake does.
Versioned data also impacts how engines execute certain transactions. For
example, when you “overwrite” a Delta table, you don’t physically remove files
from storage. You simply mark the existing files as deleted, but don’t actually
delete them. This is referred to as a “logical delete”.
Parquet tables don’t support versioned data. When you remove data from a
Parquet table, you actually delete it from storage, which is referred to as a
“physical deletes”. If you overwrite a Parquet table, it is an irreversible error
while it’s easy to undo an overwrite transaction in a Delta table.
Versioned data also allows you to easily switch between different versions of your
Delta Lake, which is referred to as time travel.
command on the table. vacuum is not triggered automatically and the default
retention threshold for the files is 7 days
It won’t delete all of the data older than 7 days old, of course. If there is data
still required in the current version of the Delta Lake that’s “old,” it won’t get
deleted.
Let us look at a scenario where you have a user who would like their account
deleted and all their data removed from your systems. You have some of
their data stored in your table. Your table has 50,000 files, and this particular
customer has data in 10 of those files.
If you have a Parquet table, the only convenient operation is to read all the
data, filter out the data for that particular user, and then rewrite the entire
table. That will take a long time!
Delta Lake makes it easy to run a delete command and will efficiently rewrite the
ten impacted files without the customer data. Delta Lake also makes it easy to
write a file that flags the rows that are deleted (deletion vectors), which makes this
operation run even faster.
If you work with a Parquet table, you don’t have any access to a merge
command. You need to implement all the low level merge details yourself,
which is challenging and time consuming.
Conclusion
These were all the ways a Delta Lake table is better than Parquet table, even
though both the tables store data in Parquet file format. I hope this helped
you understand the differences and appreciate how a Lakehouse
architecture (built on the Delta Lake technology) helps us by combining the
best of the both worlds — data warehouses and data lakes.
References:
https://docs.delta.io/latest/index.html
Delta Lake Lakehouse Architecture Databricks Delta Lake Advantages Of Delta Lake
48 Followers
Data Engineer
More from Abhinav Prakash
22
Six point checklist for Spark job Data stores over the years
optimization Lets discuss the different data storage
I have been scouring the internet to try and architectures over the years to understand
understand the best ways to optimize a spark which is the best one for you. Also, why is it
job. Here, I am summarizing my findings. best?
3 min read · Mar 22, 2023
This… 3 min read · Mar 6, 2023
41 8
326 2 2.1K 31
Lists
365 1 197 4
302 3 108
Help Status About Careers Blog Privacy Terms Text to speech Teams