Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Why Delta Lake?

 Challenges in implementation of a data lake.


 Missing ACID properties.
 Lack of Schema enforcement.
 Lack of Consistency.
 Lack of Data Quality.
 Too many small files.
 Corrupted data due to frequent job failures in prod.
What is Delta Lake?

 Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark
and big data workloads on Data Lakes.
 Delta Lake provides ACID transactions on spark and scalable metadata handling.
 Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark
APIs.
 Delta Lake supports Parquet format, Schema enforcement, Time travel And Upserts and
deletes.
Delta Lake Features
An open-source storage format that brings ACID transactions to Apache Spark and big data
workloads.

 Open Format: Stored as Parquet format in blob storage.


 ACID Transactions: Ensures data integrity and read consistency with complex, concurrent
data pipelines.
 Schema Enforcement and Evolution: Ensures data cleanliness by blocking writes with
unexpected.
 Audit History: History of all the operations that happened in the table.
 Time Travel: Query previous versions of the table by time or version number.
 Deletes and upserts: Supports deleting and upserting into tables with programmatic APIs.
 Scalable Metadata management: Able to handle millions of files are scaling the metadata
operations with Spark.
 Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table,
as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and
interactive queries all just work out of the box.
Data Reliability

 ACID Transactions
 Insert/Updates/Deletes
 Schema Enforcement
 Time Travel
 Unified batch and streaming
Performance

 Compaction(OPTIMIZE)
 Caching
 Data Skipping
 Indexing(Z-ordering)
Delta_log folder Explanation:
Understanding the Delta Lake Transaction Log - Databricks Blog
emp_csv_df=spark.read.csv(“/FileStore/tables/emp.csv”,header=True,inferSchema=True)
How to write parquet format?
emp_csv_df.write.format(‘parquet’).mode(‘overwrite’).save(‘/FileStore/tables/target/parquet’)
How to write delta format?
emp_csv_df.write.format(‘delta’).mode(‘overwrite’).save(‘/FileStore/tables/target/delta/’)
Diffrence between parquet and delta format?
While creating parquet format file will be remove file and create new one but in delta format will
create new version files.
How To Create delta lake table in metastore, delta lake table?

what is schema enforcement in delta lake?

Delta Lake Schema Evolution | Delta Lake


We have option to option("mergeSchema","True")
emp_csv_df.write.format('delta').mode('append').option("mergeSchema","True").save('/FileStore/
tables/target/delta')

Delta Lake Timetravel and Audit Log


Delta Lake Time Travel | Delta Lake
%sql
select * from emp_delta_format version as of 1

%sql
select * from emp_delta_format timestamp as of '2023-11-07T08:58:19Z'

What is Delta Lake Restore


How to Rollback a Delta Lake Table to a Previous Version with Restore | Delta Lake
RESTORE TABLE dept to VERSION AS OF 1

What is Delta Lake Vacuum


Remove unused data files with vacuum
Vaccum delete greater than 7 days or 168 hours files bydefault. If you want to delete less than 7 days
or 168 hours then you should enable below configuration.
spark.conf.set("spark.databriks.delta.retentiondurationCheck.enabled","false")
vaccum dept retain 0 hours
You can remove data files no longer referenced by a Delta table that are older than the retention
threshold by running the VACUUM command on the table. Running VACUUM regularly is
important for cost and compliance because of the following considerations:

Deleting unused data files reduces cloud storage costs.


Data files removed by VACUUM might contain records that have been modified or deleted.
Permanently removing these files from cloud storage ensures these records are no longer accessible.
The default retention threshold for data files after running VACUUM is 7 days. To change this
behavior, see Configure data retention for time travel queries.
Databricks recommends using predictive optimization to automatically run VACUUM for Delta
tables. See Predictive optimization for Delta Lake.
1. Run VACUUM DRY RUN to determine the number of files eligible for deletion. Replace <table-
path> with the actual table path location.
2. %python
3.

spark.sql("VACUUM delta.`<table-path>` DRY RUN" )


The DRY RUN option tells VACUUM it should not delete any files. Instead, DRY RUN prints the
number of files and directories that are safe to be deleted. The intention in this step is not to delete
the files, but know the number of files eligible for deletion.
The example DRY RUN command returns an output which tells us that there are x files and
directories that are safe to be deleted.
Found x files and directories in a total of y directories that are safe to delete.
You should record the number of files identified as safe to delete.

4. Run VACUUM.

5. Cancel VACUUM after one hour.

6. Run VACUUM with DRY RUN again.

7. The second DRY RUN command identifies the number of outstanding files that can be safely
deleted.

8. Subtract the outstanding number of files (second DRY RUN) from the original number of files to
get the number of files that were deleted.

What is Delta Lake Optimize

Optimize is option if you have a more number of small files you can run on top of that then it will
remove smaller files these 100 files it will remove logically then it will create a new files with average
file size 1GB.
Delta Lake provides an OPTIMIZE command that lets users compact the small files into larger files,
so their queries are not burdened by the small file overhead. This is because every shuffle task can
write multiple files in multiple partitions, and can become a performance bottleneck. You can reduce
the number of files by enabling optimized writes. To improve query speed, Delta Lake supports the
ability to optimize the layout of data in storage.

You might also like