Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

Welcome!

Introduction to
Data Engineering
With Delta
Your Hosts for Today

Joel Roland Noble Raveendran


SA Manager, ANZ Solutions Architect
Agenda

▪ Overview of Data Engineering powered by Delta Lake


▪ Demo
▪ Q&A
Housekeeping

▪ If you have questions during the session please post in the Chat
Window

▪ We will have a number of Polls during the event - they will pop up so
please respond when they do
The Data Engineer’s Journey...

Events
Stream

Stream Unified View AI & Reporting


Validation

Updates & Merge get complex


Batch Batch
Table with data lake
Table
(Data gets compacted
(Data gets written
every hour) Update & Merge
continuously)
Reprocessing
The Data Engineer’s Journey...

Events

d ?
ifie
Stream

p l
be sim AI & Reporting

his
Stream Validation Unified View

n t
Ca
Batch Batch Updates & Merge get complex
Table with data lake
Table
(Data gets compacted
(Data gets written
every hour) Update & Merge
continuously)
Reprocessing
A Data Engineer’s Dream...

Process data continuously and incrementally as new data arrives in a


cost-efficient way without having to choose between batch or
streaming

Kinesis

CSV, AI & Reporting


JSON, TXT…
Data Lake
What’s missing?
Kinesis
CSV,
?
JSON, TXT… AI & Reporting
Data Lake

• Ability to read consistent data while data is being written

• Ability to read incrementally from a large table with good throughput

• Ability to rollback in case of bad writes

• Ability to replay historical data along new data that arrived

• Ability to handle late arriving data without having to delay downstream processing
So… What is the answer?

Delta
+ = Architecture
STRUCTURED
STREAMING

1. Unify batch & streaming with a continuous data flow model


2. Infinite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing
costs
Let’s try it instead with
Well…what is
?
Data reliability challenges with data lakes
Failed production jobs leave data in corrupt
✗ state requiring tedious recovery

Lack of schema enforcement creates


inconsistent and low quality data

Lack of consistency makes it almost impossible to


mix appends and reads, batch and streaming
Performance challenges with data lakes
Too many small or very big files - more time opening &
closing files rather than reading contents (worse with
streaming).

Partitioning aka “poor man’s indexing”- breaks down if


you picked the wrong fields or when data has many
dimensions, high cardinality columns.

No caching - cloud storage throughput is low (cloud object


storage is 20-50MB/s/core vs 300MB/s/core for local
SSDs).
A New Standard for Building Data Lakes

Open Format Based on Parquet

With Transactions

Apache Spark API’s


Delta Lake: makes data ready for analytics

Data Science & ML

Reliability

Performance
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
Delta Lake ensures data reliability

Batch
Parquet Files
Streaming
High Quality & Reliable Data
always ready for analytics
Updates/Deletes Transactional
Log

Key Features ● ACID Transactions ● Unified Batch & Streaming


● Schema Enforcement ● Time Travel/Data Snapshots
Delta Lake optimizes performance
Databricks
optimized engine

Highly Performant
queries at scale

Parquet Files

Transactional
Log

Key Features ● Indexing ● Data skipping


● Compaction ● Caching
Now, let’s try it with
*Data Quality Levels *
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Quality

Delta Lake allows you to incrementally improve the


quality of your data until it is ready for consumption.
What does this remind you of ?
Data Lifecycle
Bronz Silve Gold
e r Streamin
Kinesis
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting

Quality
Data Lifecycle
OLTP
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting

Quality
Data Lifecycle
OLTP Staging
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting

Quality
Data Lifecycle
OLTP Staging DW/OLAP
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting

Quality
Data Lifecycle 🡪 Delta Lake Lifecycle
*Data Quality Levels *
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake Ingestion Aggregates
Augmented
AI &
Reporting
Quality

Delta Lake allows you to incrementally improve the


quality of your data until it is ready for consumption.
OLTP

Bronz Silve Gold


Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…

Data Lake
Raw Filtered, Cleaned Business-level
Ingestion Augmented Aggregates
AI &
Reporting
• Dumping ground for raw data
• Often with long retention (years)
• Avoid error-prone parsing
Staging

Bronz Silve Gold


Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…

Data Lake
Raw Filtered, Cleaned Business-level
Ingestion Augmented Aggregates
AI &
Reporting
Intermediate data with some cleanup applied
Query-able for easy debugging!
DW/OLAP

Bronz Silve Gold


Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Clean data, ready for consumption
Read with Spark and other SQL Query Engines
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting
Streams move data through the Delta Lake
• Low-latency or manually triggered
• Eliminates management of schedules and jobs
OVERWRITE

MERGE
INSERT DELETE
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
UPDATE AI &
Reporting
Delta Lake also supports batch jobs
and standard DML
• Retention • GDPR
• Corrections • UPSERTS
DELETE DELETE
Bronz Silve Gold
Kinesis e r Streamin
g
Analytics
CSV,
JSON, TXT…
Raw Filtered, Cleaned Business-level
Data Lake
Ingestion Augmented Aggregates
AI &
Reporting

Easy to recompute when business logic changes:


• Clear tables
• Restart streams
Delta
Architecture
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake

• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake

• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers

• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake

• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers

• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling

• Ability to rollback in case of bad writes Time travel


Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake

• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers

• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling

• Ability to rollback in case of bad writes Time travel

• Ability to replay historical data along new Stream the backfilled historical data
data that arrived through the same pipeline
Connecting the dots...
Kinesis
CSV,
? AI & Reporting
JSON, TXT…
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers

• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling

• Ability to rollback in case of bad writes Time travel

• Ability to replay historical data along new Stream the backfilled historical data
data that arrived through the same pipeline

• Ability to handle late arriving data without Stream any late arriving data added to the
having to delay downstream processing table as they get added
Connecting the dots...
Kinesis
CSV,
JSON, TXT… AI & Reporting
Data Lake
• Ability to read consistent data while data is Snapshot isolation between writers and
being written readers

• Ability to read incrementally from a large Optimized file source with scalable
table with good throughput metadata handling

• Ability to rollback in case of bad writes Time travel

• Ability to replay historical data along new Stream the backfilled historical data
data that arrived through the same pipeline

• Ability to handle late arriving data without Stream any late arriving data added to the
having to delay downstream processing table as they get added
Get Started with Delta using Spark APIs
Instead of parquet... … simply say delta

CREATE TABLE ... CREATE TABLE ...


USING parquet USING delta
... …

dataframe dataframe
.write .write
.format("parquet") .format("delta")
.save("/data") .save("/data")
Using Delta with your Existing Parquet
Tables
Step 1: Convert Parquet to Delta Tables

CONVERT TO DELTA parquet.`path/to/table` [NO STATISTICS]


[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)]

Step 2: Optimize Layout for Fast Queries


OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
Upsert/Merge: Fine-grained Updates
MERGE INTO customers -- Delta table
USING updates
ON customers.customerId = updates.customerId
WHEN MATCHED THEN
UPDATE SET address = updates.address
WHEN NOT MATCHED THEN
INSERT (customerId, address) VALUES (updates.customerId, updates.address)
Time Travel
Reproduce experiments & reports Rollback accidental bad writes

SELECT count(*) FROM events INSERT INTO my_table

TIMESTAMP AS OF timestamp SELECT * FROM my_table TIMESTAMP AS OF


date_sub(current_date(), 1)

SELECT count(*) FROM events VERSION


AS OF version

spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/events/")
Databricks Ingest: Auto Loader
Load new data easily and efficiently as it arrives in cloud storage

Before After
Notification Message Auto
Service Queue Loader

Stream

Batch

Delayed ● Pipe data from cloud storage into Delta Lake as


schedule
it arrives
External Trigger
● “Set and forget” model eliminates complex setup
Airflow file sensor

Gets too complicated for multiple jobs


Launch Blog Post
Databricks Ingest: Data Ingestion Network
All Your Application, Database, and File Storage Data in your Delta Lake
Delta Lake Connectors
Standardize your big data storage with an open format accessible from various tools

Amazon Redshift

Amazon Athena
Databricks Notebooks are a powerful developer environment
to enable Dashboards, Workflows, and Jobs.

Notebook Dashboards
Databricks Notebooks Create interactive notebooks based on
Collaborative, multi-language, Notebooks with just a few clicks.
enterprise-ready developer environment.

Notebook Workflows
Orchestrate multi-stage pipelines based on
notebook dependencies.

Notebook Jobs
Schedule, orchestrate, and monitor
execution of notebooks.
Notebook Workflows enable orchestration of
multi-stage pipelines based on notebook dependencies.

Workflow Definition
APIs allow flexible definition of notebook
dependencies, including parallel
execution of notebooks.

Databricks Job Scheduler

Workflow Execution
The Databricks Job scheduler
manages the execution of
Notebook workflows.
Notebook Jobs enable scheduling and monitoring
of Notebooks.

Turn Notebooks into Jobs


Any Databricks Notebook or Notebook
Workflow can easily be turned into a job.

Schedule Jobs
Jobs can be configured to execute on a
schedule (e.g. daily, hourly).

Orchestrate Jobs
In addition to Notebook Workflows, Jobs
can also be orchestrated using third-party
tools like Airflow.
Users of Delta Lake
Improved reliability:
Petabyte-scale jobs

10x lower compute:


640 instances to 64!

Faster iterations:
Multiple wks to 5 min
deploys!
● 2 Regions Uses Cases
● 10 Workspaces Improved performance:
● Finance
● 100+ Users Queries run faster
● 50+ Scheduled ● Assortment
Jobs
>1 hr → < 6 sec
● Fresh Sales
● 1,000+ Notebooks
Tool
● Scores of ML
● Fraud Engine
Easier transactional
models
● Personalization updates:
No downtime or
consistency issues!

Simple CDC:
Easy with MERGE
Data consistency and integrity:
not available before

Increased data quality:


name match accuracy up from
80% to 95%

Faster data loads:


24 hours → 20 mins

Databricks Cluster
Demo
Q&A
Your feedback is appreciated

First 20 responses will receive a $10


UberEats voucher
TWO LIVE SESSIONS:
Australia & NZ: 9.00am - 11.30am AEST
Singapore & India: 1.00pm - 3.30pm SGT | 10.30am - 1.00pm IST

Hear the data innovation stories of Atlassian, Coles Group, Grab


Tabcorp, and more.
Register now: https://databricks.com/p/event/data-ai-tour
June 22-26 | Organized by

THE FREE VIRTUAL


EVENT FOR DATA TEAMS
● Extended to 5 days with over 200 sessions
● 4x the pre-conference training
● Keynotes by visionaries and thought
leaders
https://databricks.com/sparkaisummit/north-america-2020

You might also like