Intro To Data Engineering Databricks Webinar 13may

Welcome!
Introduction to
Data Engineering
With Delta
Your Hosts for Today
Joel Roland Noble Raveendran

SA Manager, ANZ Solutions Architect
Agenda
▪ Overview of Data Engineering powered by Delta Lake

▪ Demo
▪ Q&A
Housekeeping
▪ If you have questions during the session please post in the Chat
Window
▪ We will have a number of Polls during the event - they will pop up so
please respond when they do
The Data Engineer’s Journey...
Events
Stream
Stream Unified View AI & Reporting

Validation
Updates & Merge get complex

Batch Batch
Table with data lake
Table
(Data gets compacted
(Data gets written
every hour) Update & Merge
continuously)
Reprocessing
The Data Engineer’s Journey...
Events
d ?
ifie
Stream
p l
be sim AI & Reporting
his
Stream Validation Unified View
n t
Ca
Batch Batch Updates & Merge get complex
Table with data lake
Table
(Data gets compacted
(Data gets written
every hour) Update & Merge
continuously)
Reprocessing
A Data Engineer’s Dream...
Process data continuously and incrementally as new data arrives in a

cost-efficient way without having to choose between batch or
streaming
Kinesis
CSV, AI & Reporting

JSON, TXT…
Data Lake
What’s missing?
Kinesis
CSV,
?
JSON, TXT… AI & Reporting
Data Lake
• Ability to read consistent data while data is being written
• Ability to read incrementally from a large table with good throughput
• Ability to rollback in case of bad writes
• Ability to replay historical data along new data that arrived
• Ability to handle late arriving data without having to delay downstream processing
So… What is the answer?
Delta
+ = Architecture
STRUCTURED
STREAMING
1. Unify batch & streaming with a continuous data flow model

2. Infinite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing
costs
Let’s try it instead with
Well…what is
?
Data reliability challenges with data lakes
Failed production jobs leave data in corrupt
✗ state requiring tedious recovery
Lack of schema enforcement creates

inconsistent and low quality data
Lack of consistency makes it almost impossible to

mix appends and reads, batch and streaming
Performance challenges with data lakes
Too many small or very big files - more time opening &
closing files rather than reading contents (worse with
streaming).
Partitioning aka “poor man’s indexing”- breaks down if

you picked the wrong fields or when data has many
dimensions, high cardinality columns.
No caching - cloud storage throughput is low (cloud object

storage is 20-50MB/s/core vs 300MB/s/core for local
SSDs).
A New Standard for Building Data Lakes
Open Format Based on Parquet
With Transactions
Apache Spark API’s

Delta Lake: makes data ready for analytics
Data Science & ML
Reliability
Performance
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
Delta Lake ensures data reliability
Batch
Parquet Files
Streaming
High Quality & Reliable Data
always ready for analytics
Updates/Deletes Transactional
Log
Key Features ● ACID Transactions ● Unified Batch & Streaming

● Schema Enforcement ● Time Travel/Data Snapshots
Delta Lake optimizes performance
Databricks
optimized engine
Highly Performant
queries at scale
Parquet Files
Transactional
Log
Key Features ● Indexing ● Data skipping

● Compaction ● Caching
Now, let’s try it with
*Data Quality Levels *
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Quality
Delta Lake allows you to incrementally improve the

quality of your data until it is ready for consumption.
What does this remind you of ?
Data Lifecycle
Bronz Silve Gold
e r Streamin
Kinesis
g
Analytics
CSV,
JSON, TXT…
Data Lake
AI &
Reporting
Quality
Data Lifecycle
OLTP
Bronz Silve Gold
g
Analytics
CSV,
JSON, TXT…
Data Lake
AI &
Reporting
Quality
Data Lifecycle
OLTP Staging
Bronz Silve Gold
g
Analytics
CSV,
JSON, TXT…
Data Lake
AI &
Reporting
Quality
Data Lifecycle
OLTP Staging DW/OLAP
Bronz Silve Gold
g
Analytics
CSV,
JSON, TXT…
Data Lake
AI &
Reporting
Quality
Data Lifecycle 🡪 Delta Lake Lifecycle
*Data Quality Levels *
Bronz Silve Gold
g
Analytics
CSV,
JSON, TXT…
Data Lake Ingestion Aggregates
Augmented
AI &
Reporting
Quality
Delta Lake allows you to incrementally improve the

quality of your data until it is ready for consumption.
OLTP
Bronz Silve Gold

g
Analytics
CSV,
JSON, TXT…
Data Lake
AI &
Reporting
• Dumping ground for raw data
• Often with long retention (years)
• Avoid error-prone parsing
Staging
Bronz Silve Gold

g
Analytics
CSV,
JSON, TXT…
Data Lake
AI &
Reporting
Intermediate data with some cleanup applied
Query-able for easy debugging!
DW/OLAP
Bronz Silve Gold

g
Analytics
CSV,
JSON, TXT…
Data Lake
AI &
Reporting
Clean data, ready for consumption
Read with Spark and other SQL Query Engines
Bronz Silve Gold
g
Analytics
CSV,
JSON, TXT…
Data Lake
AI &
Reporting
Streams move data through the Delta Lake
• Low-latency or manually triggered
• Eliminates management of schedules and jobs
OVERWRITE
MERGE
INSERT DELETE
Bronz Silve Gold
g
Analytics
CSV,
JSON, TXT…
Data Lake
UPDATE AI &
Reporting
Delta Lake also supports batch jobs
and standard DML
• Retention • GDPR
• Corrections • UPSERTS
DELETE DELETE
Bronz Silve Gold
g
Analytics
CSV,
JSON, TXT…
Data Lake
AI &
Reporting
Easy to recompute when business logic changes:

• Clear tables
• Restart streams
Delta
Architecture
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to rollback in case of bad writes Time travel

Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to replay historical data along new Stream the backfilled historical data
data that arrived through the same pipeline
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to handle late arriving data without Stream any late arriving data added to the
having to delay downstream processing table as they get added
Kinesis
CSV,
JSON, TXT… AI & Reporting
Data Lake
• Ability to handle late arriving data without Stream any late arriving data added to the
having to delay downstream processing table as they get added
Get Started with Delta using Spark APIs
Instead of parquet... … simply say delta
CREATE TABLE ... CREATE TABLE ...

USING parquet USING delta
... …
dataframe dataframe
.write .write
.format("parquet") .format("delta")
.save("/data") .save("/data")
Using Delta with your Existing Parquet
Tables
Step 1: Convert Parquet to Delta Tables
CONVERT TO DELTA parquet.`path/to/table` [NO STATISTICS]

[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)]
Step 2: Optimize Layout for Fast Queries

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
Upsert/Merge: Fine-grained Updates
MERGE INTO customers -- Delta table
USING updates
ON customers.customerId = updates.customerId
WHEN MATCHED THEN
UPDATE SET address = updates.address
WHEN NOT MATCHED THEN
INSERT (customerId, address) VALUES (updates.customerId, updates.address)
Time Travel
Reproduce experiments & reports Rollback accidental bad writes
SELECT count(*) FROM events INSERT INTO my_table
TIMESTAMP AS OF timestamp SELECT * FROM my_table TIMESTAMP AS OF

date_sub(current_date(), 1)
SELECT count(*) FROM events VERSION

AS OF version
spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/events/")
Databricks Ingest: Auto Loader
Load new data easily and efficiently as it arrives in cloud storage
Before After
Notification Message Auto
Service Queue Loader
Stream
Batch
Delayed ● Pipe data from cloud storage into Delta Lake as

schedule
it arrives
External Trigger
● “Set and forget” model eliminates complex setup
Airflow file sensor
Gets too complicated for multiple jobs

Launch Blog Post
Databricks Ingest: Data Ingestion Network
All Your Application, Database, and File Storage Data in your Delta Lake
Delta Lake Connectors
Standardize your big data storage with an open format accessible from various tools
Amazon Redshift
Amazon Athena
Databricks Notebooks are a powerful developer environment
to enable Dashboards, Workflows, and Jobs.
Notebook Dashboards
Databricks Notebooks Create interactive notebooks based on
Collaborative, multi-language, Notebooks with just a few clicks.
enterprise-ready developer environment.
Notebook Workflows
Orchestrate multi-stage pipelines based on
notebook dependencies.
Notebook Jobs
Schedule, orchestrate, and monitor
execution of notebooks.
Notebook Workflows enable orchestration of
multi-stage pipelines based on notebook dependencies.
Workflow Definition
APIs allow flexible definition of notebook
dependencies, including parallel
execution of notebooks.
Databricks Job Scheduler
Workflow Execution
The Databricks Job scheduler
manages the execution of
Notebook workflows.
Notebook Jobs enable scheduling and monitoring
of Notebooks.
Turn Notebooks into Jobs

Any Databricks Notebook or Notebook
Workflow can easily be turned into a job.
Schedule Jobs
Jobs can be configured to execute on a
schedule (e.g. daily, hourly).
Orchestrate Jobs
In addition to Notebook Workflows, Jobs
can also be orchestrated using third-party
tools like Airflow.
Users of Delta Lake
Improved reliability:
Petabyte-scale jobs
10x lower compute:

640 instances to 64!
Faster iterations:
Multiple wks to 5 min
deploys!
● 2 Regions Uses Cases
● 10 Workspaces Improved performance:
● Finance
● 100+ Users Queries run faster
● 50+ Scheduled ● Assortment
Jobs
>1 hr → < 6 sec
● Fresh Sales
● 1,000+ Notebooks
Tool
● Scores of ML
● Fraud Engine
Easier transactional
models
● Personalization updates:
No downtime or
consistency issues!
Simple CDC:
Easy with MERGE
Data consistency and integrity:
not available before
Increased data quality:

name match accuracy up from
80% to 95%
Faster data loads:

24 hours → 20 mins
Databricks Cluster
Demo
Q&A
Your feedback is appreciated
First 20 responses will receive a $10

UberEats voucher
TWO LIVE SESSIONS:
Australia & NZ: 9.00am - 11.30am AEST
Singapore & India: 1.00pm - 3.30pm SGT | 10.30am - 1.00pm IST
Hear the data innovation stories of Atlassian, Coles Group, Grab

Tabcorp, and more.
Register now: https://databricks.com/p/event/data-ai-tour
June 22-26 | Organized by
THE FREE VIRTUAL

EVENT FOR DATA TEAMS
● Extended to 5 days with over 200 sessions
● 4x the pre-conference training
● Keynotes by visionaries and thought
leaders
https://databricks.com/sparkaisummit/north-america-2020

Intro To Data Engineering Databricks Webinar 13may

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro To Data Engineering Databricks Webinar 13may

Uploaded by

Copyright:

Available Formats

Welcome!

Joel Roland Noble Raveendran

▪ Overview of Data Engineering powered by Delta Lake

Stream Uniﬁed View AI & Reporting

Updates & Merge get complex

Process data continuously and incrementally as new data arrives in a

CSV, AI & Reporting

• Ability to read consistent data while data is being written

• Ability to read incrementally from a large table with good throughput

• Ability to rollback in case of bad writes

• Ability to replay historical data along new data that arrived

1. Unify batch & streaming with a continuous data ﬂow model

Lack of schema enforcement creates

Lack of consistency makes it almost impossible to

Partitioning aka “poor man’s indexing”- breaks down if

No caching - cloud storage throughput is low (cloud object

Open Format Based on Parquet

Apache Spark API’s

Data Science & ML

Key Features ● ACID Transactions ● Uniﬁed Batch & Streaming

Key Features ● Indexing ● Data skipping

Delta Lake allows you to incrementally improve the

Delta Lake allows you to incrementally improve the

Bronz Silve Gold

Bronz Silve Gold

Bronz Silve Gold

Easy to recompute when business logic changes:

• Ability to rollback in case of bad writes Time travel

• Ability to rollback in case of bad writes Time travel

• Ability to rollback in case of bad writes Time travel

• Ability to rollback in case of bad writes Time travel

CREATE TABLE ... CREATE TABLE ...

CONVERT TO DELTA parquet.`path/to/table` [NO STATISTICS]

Step 2: Optimize Layout for Fast Queries

SELECT count(*) FROM events INSERT INTO my_table

TIMESTAMP AS OF timestamp SELECT * FROM my_table TIMESTAMP AS OF

SELECT count(*) FROM events VERSION

Delayed ● Pipe data from cloud storage into Delta Lake as

Gets too complicated for multiple jobs

Databricks Job Scheduler

Turn Notebooks into Jobs

10x lower compute:

Increased data quality:

Faster data loads:

First 20 responses will receive a $10

Hear the data innovation stories of Atlassian, Coles Group, Grab

THE FREE VIRTUAL

You might also like