Professional Documents
Culture Documents
DeltaLake Databricks
DeltaLake Databricks
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark
and big data workloads on Data Lakes.
Delta Lake provides ACID transactions on spark and scalable metadata handling.
Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark
APIs.
Delta Lake supports Parquet format, Schema enforcement, Time travel And Upserts and
deletes.
Delta Lake Features
An open-source storage format that brings ACID transactions to Apache Spark and big data
workloads.
ACID Transactions
Insert/Updates/Deletes
Schema Enforcement
Time Travel
Unified batch and streaming
Performance
Compaction(OPTIMIZE)
Caching
Data Skipping
Indexing(Z-ordering)
Delta_log folder Explanation:
Understanding the Delta Lake Transaction Log - Databricks Blog
emp_csv_df=spark.read.csv(“/FileStore/tables/emp.csv”,header=True,inferSchema=True)
How to write parquet format?
emp_csv_df.write.format(‘parquet’).mode(‘overwrite’).save(‘/FileStore/tables/target/parquet’)
How to write delta format?
emp_csv_df.write.format(‘delta’).mode(‘overwrite’).save(‘/FileStore/tables/target/delta/’)
Diffrence between parquet and delta format?
While creating parquet format file will be remove file and create new one but in delta format will
create new version files.
How To Create delta lake table in metastore, delta lake table?
%sql
select * from emp_delta_format timestamp as of '2023-11-07T08:58:19Z'
4. Run VACUUM.
7. The second DRY RUN command identifies the number of outstanding files that can be safely
deleted.
8. Subtract the outstanding number of files (second DRY RUN) from the original number of files to
get the number of files that were deleted.
Optimize is option if you have a more number of small files you can run on top of that then it will
remove smaller files these 100 files it will remove logically then it will create a new files with average
file size 1GB.
Delta Lake provides an OPTIMIZE command that lets users compact the small files into larger files,
so their queries are not burdened by the small file overhead. This is because every shuffle task can
write multiple files in multiple partitions, and can become a performance bottleneck. You can reduce
the number of files by enabling optimized writes. To improve query speed, Delta Lake supports the
ability to optimize the layout of data in storage.