Professional Documents
Culture Documents
Intro To Data Engineering Databricks Webinar 13may
Intro To Data Engineering Databricks Webinar 13may
Introduction to
Data Engineering
With Delta
Your Hosts for Today
▪ If you have questions during the session please post in the Chat
Window
▪ We will have a number of Polls during the event - they will pop up so
please respond when they do
The Data Engineer’s Journey...
Events
Stream
Events
d ?
ifie
Stream
p l
be sim AI & Reporting
his
Stream Validation Unified View
n t
Ca
Batch Batch Updates & Merge get complex
Table with data lake
Table
(Data gets compacted
(Data gets written
every hour) Update & Merge
continuously)
Reprocessing
A Data Engineer’s Dream...
Kinesis
• Ability to handle late arriving data without having to delay downstream processing
So… What is the answer?
Delta
+ = Architecture
STRUCTURED
STREAMING
With Transactions
Reliability
Performance
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
Delta Lake ensures data reliability
Batch
Parquet Files
Streaming
High Quality & Reliable Data
always ready for analytics
Updates/Deletes Transactional
Log
Highly Performant
queries at scale
Parquet Files
Transactional
Log
Quality
Data Lifecycle
OLTP
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Quality
Data Lifecycle
OLTP Staging
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Quality
Data Lifecycle
OLTP Staging DW/OLAP
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Quality
Data Lifecycle 🡪 Delta Lake Lifecycle
*Data Quality Levels *
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake Ingestion Aggregates
Augmented
AI &
Reporting
Quality
Data Lake
Raw Filtered, Cleaned Business-level
Ingestion Augmented Aggregates
AI &
Reporting
• Dumping ground for raw data
• Often with long retention (years)
• Avoid error-prone parsing
Staging
Data Lake
Raw Filtered, Cleaned Business-level
Ingestion Augmented Aggregates
AI &
Reporting
Intermediate data with some cleanup applied
Query-able for easy debugging!
DW/OLAP
MERGE
INSERT DELETE
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
UPDATE AI &
Reporting
Delta Lake also supports batch jobs
and standard DML
• Retention • GDPR
• Corrections • UPSERTS
DELETE DELETE
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
• Ability to replay historical data along new Stream the backfilled historical data
data that arrived through the same pipeline
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
• Ability to replay historical data along new Stream the backfilled historical data
data that arrived through the same pipeline
• Ability to handle late arriving data without Stream any late arriving data added to the
having to delay downstream processing table as they get added
Connecting the dots...
Kinesis
CSV,
JSON, TXT… AI & Reporting
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
• Ability to replay historical data along new Stream the backfilled historical data
data that arrived through the same pipeline
• Ability to handle late arriving data without Stream any late arriving data added to the
having to delay downstream processing table as they get added
Get Started with Delta using Spark APIs
Instead of parquet... … simply say delta
dataframe dataframe
.write .write
.format("parquet") .format("delta")
.save("/data") .save("/data")
Using Delta with your Existing Parquet
Tables
Step 1: Convert Parquet to Delta Tables
spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/events/")
Databricks Ingest: Auto Loader
Load new data easily and efficiently as it arrives in cloud storage
Before After
Notification Message Auto
Service Queue Loader
Stream
Batch
Amazon Redshift
Amazon Athena
Databricks Notebooks are a powerful developer environment
to enable Dashboards, Workflows, and Jobs.
Notebook Dashboards
Databricks Notebooks Create interactive notebooks based on
Collaborative, multi-language, Notebooks with just a few clicks.
enterprise-ready developer environment.
Notebook Workflows
Orchestrate multi-stage pipelines based on
notebook dependencies.
Notebook Jobs
Schedule, orchestrate, and monitor
execution of notebooks.
Notebook Workflows enable orchestration of
multi-stage pipelines based on notebook dependencies.
Workflow Definition
APIs allow flexible definition of notebook
dependencies, including parallel
execution of notebooks.
Workflow Execution
The Databricks Job scheduler
manages the execution of
Notebook workflows.
Notebook Jobs enable scheduling and monitoring
of Notebooks.
Schedule Jobs
Jobs can be configured to execute on a
schedule (e.g. daily, hourly).
Orchestrate Jobs
In addition to Notebook Workflows, Jobs
can also be orchestrated using third-party
tools like Airflow.
Users of Delta Lake
Improved reliability:
Petabyte-scale jobs
Faster iterations:
Multiple wks to 5 min
deploys!
● 2 Regions Uses Cases
● 10 Workspaces Improved performance:
● Finance
● 100+ Users Queries run faster
● 50+ Scheduled ● Assortment
Jobs
>1 hr → < 6 sec
● Fresh Sales
● 1,000+ Notebooks
Tool
● Scores of ML
● Fraud Engine
Easier transactional
models
● Personalization updates:
No downtime or
consistency issues!
Simple CDC:
Easy with MERGE
Data consistency and integrity:
not available before
Databricks Cluster
Demo
Q&A
Your feedback is appreciated