Databricks For The SQL Developer: Gerhard Brueckl

Databricks for the
SQL Developer
Gerhard Brueckl
Bronze Silver Gold Platinum
Our Partners
About me
@gbrueckl
blog.gbrueckl.at www.paiqo.com
gerhard@gbrueckl.at
https://github.com/gbrueckl
Agenda
What is Databricks / Spark?
Why use Databricks for SQL workloads?
SQL with Databricks / Spark
Delta Lake
Advanced SQL techniques

What is Databricks?
Company that provides a Big Data processing

solution in the Cloud using Apache Spark
Founded in 2013
Creators of Apache® Spark™
Offers: Databricks on AWS, Azure Databricks

NO on-prem solution!
What is Apache Spark?
Open-source Cluster computing framework

Runs on YARN, Mesos, …
Built for: Speed, Ease-of-Use, Extensibility
Support for multiple languages

Java, Scala, Python, R, SQL
Project of the Apache Foundation

Largest open-source data project
Apache Spark APIs
Spark unifies: SQL

• Batch Processing
• Interactive SQL
• Real-time processing
• Machine Learning
• Deep Learning
• Graph Processing
How does it work?
‘Driver’ runs the ‘main’ function and executes and

coordinates the various parallel operations on the
worker nodes
The worker nodes read and write data from/to
Data Sources including HDFS
Worker node also caches transformed data in
memory as RDDs (Resilient Distributed Data Sets)
The results of the operations are collected by the
driver and returned to the client.
Worker nodes and the Driver Node execute as

VMs in public clouds (AWS or Azure)
Azure Databricks
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
Azure Databricks
Collaborative Workspace
IoT / streaming data Machine learning models
DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST
Deploy Production Jobs & Workflows

BI tools
Cloud storage
MULTI-STAGE
JOB SCHEDULER NOTIFICATION & LOGS
PIPELINES
Data warehouses
Optimized Databricks Runtime Engine Data exports
Hadoop storage
DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs
Data warehouses
Enhance Productivity Build on secure & trusted cloud Scale without limits
Why use Databricks for SQL workloads?
• Cloud-only solutions • Scalability
• Works with structured and • Native SQL integration

unstructured data
• Native Azure integration
• Open Standard
• Apache Spark
• Single Tool for all Workloads
• Extensible
Batch processing only – no OLTP!!!
Spark SQL Fundamentals
• Tables are just references and • No indexes

metadata
• Location
• Column definitions
• (No Stored Procedures)
• Partitions
…
•
• (No Statistics)
• (similar to external tables in Polybase)
• Files must match the schema!

Supported SQL Features
ANSI SQL • SELECT and INSERT only!

• Joins
• Groupings/Aggregations
• Rollup/Cube/GroupingSets • Views
• Subselects
• Window Functions
• …
• Temporary tables
Transformations/Functions
Constantly evolving!
SQL SELECT and INSERT
SELECT
• Tables or Views SELECT
SELECT INSERT
• on files directly!
INSERT Databricks
• Creates new file in folder Storage
CREATE TABLE AS SELECT

• Persists result of a SQL query
Processing of a SQL Query
1. Client submits SQL Query
2. Databricks queries Meta Data Catalog
• Checks syntax
Meta Data Store • Checks columns
Cloud storage
• Returns storage locations
3. Databricks queries storage services for raw data

4. Data is loaded into memory of nodes
5. Data is processed on nodes using Spark
SQL Cloud storage
“Result”
Data is written directly to storage services
OR
Data is collected on driver and returned to client
Cloud storage
Databricks Meta Data Store
= Apache Hive Metastore
Metadata of all SQL objects

Databases, Tables, Columns, …
Managed by Databricks
OR
Hosted externally
MSSQL / Azure SQL
MySQL / Azure MySQL
Can be Shared!
Types of Tables
Managed
Stored inside Databricks
Azure Blob Storage
Filesystem not accessible from outside
DROP TABLE also deletes files!
Unmanaged
Usually stored externally
Azure Blob Storage / Azure Data Lake Store / …
Can be shared with other services
DEMO
Delta Lake – delta.io
Delta Lake is an open-source storage layer that brings ACID

transactions to Apache Spark™ and big data workloads.
• ACID compliant transactions • Schema enforcement and evolution

• Optimistic Concurrency Control • Across multiple files/folders
• Support for UPDATE / MERGE • Batch & Streaming
• Time-Travel • 100% compatible with Apache Spark

Delta Lake – CREATE TABLE
CREATE TABLE IF NOT EXISTS DimProductDelta

USING DELTA
PARTITIONED BY (ProductSubcategoryKey)
-- CLUSTERED BY (ProductKey) INTO 4 BUCKETS
LOCATION '/mnt/adls/tables/DimProductDelta'
TBLPROPERTIES ('myKey' = 'myValue')
Avoid defining Columns explicitly – handled by transaction log!
Clustering is not supported!

Delta Lake – UPDATE/DELETE/MERGE
Always results in new files! Even a DELETE!
Old files are invalidated via _delta_log
Operations are logged in _delta_log
Conflicts have to be handled by the User!
Can create A LOT of files!

Delta Lake – UPDATE
Product Price Product Price
UPDATE TABLE DimProduct
Notebook 900 € Notebook 900 €
User
SET Price = 1300

PC 1,500 € PC 1,300 €
WHERE Product = 'PC'
Tablet 500 € Tablet 500 €
000000000.json 000000001.json
_delta_log
"add": {
"remove": { "path": "part-01.parquet", ... },
"path": "part-01.parquet",
"add": { "path": "part-02.parquet", ... }
...
}
Storage
part-01 part-02
part-01
(3 rows) (3 rows) (3 rows)
Delta Lake –DELETE
Product Price Product Price
Notebook 900 € Notebook 900 €
User
DELETE FROM DimProduct

PC 1,500 € WHERE Product = 'PC' Tablet 500 €
Tablet 500 €
000000001.json 000000002.json
_delta_log
"add": { "remove": { "path": "part-01.parquet", ... },

"path": "part-01.parquet", "add": { "path": "part-02.parquet", ... }
... }
Storage
part-01 part-02
part-01
(3 rows) (3 rows) (2 rows)
Delta Lake – _delta_log
Delta Lake – _delta_log
Delta Lake – _delta_log – Schema / Stats
Delta Lake – _delta_log - UPDATE
Delta Lake – Optimization
Manually with OPTIMIZE command
Collapse many small files into few big files
Optimizes current/latest version only!
Bin-Packing / Ordering OPTIMIZE events

WHERE date = 20200101
ZORDER BY (eventType)
Delta Lake – Clean-Up
Manually with VACUUM command
Automatically with every INSERT/UPDATE/MERGE
Default retention period is 7 days

Can be changed via TBLPROPERTIES
VACUUM events
[RETAIN num HOURS] [DRY RUN]
Delta Lake – Table Properties
Clean-Up Settings
ALTER TABLE DimProductDelta SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '240 HOURS');
ALTER TABLE DimProductDelta SET TBLPROPERTIES ('delta.logRetentionDuration' = '240 HOURS’);
Blocks deletes and modifications of a table

'delta.appendOnly' = 'true'
Configures the number of columns for which statistics are collected

'delta.dataSkippingNumIndexedCols' = '5'
CREATE TABLE DimProductDelta

USING DELTA
TBLPROPERTIES ('key1' = 'val1', 'key2' = 'val2', ...)
Delta Lake – New Meta Tables / DMVs
Get information about schema, partitioning, table size, …

DESCRIBE DETAILS myDeltaTable;
Provides provenance information, including the operation, user, and so on, for
each write to a table
DESCRIBE HISTORY myDeltaTable;
DEMO
Advanced SQL – Extensions
User Defined Functions

Python or Scala
User Defined Aggregates

Scala only
Session-Level only!
Use Globally when packed as JAR

DEMO
Advanced SQL – Sampling
Sampling
SELECT * FROM myTable TABLESAMPLE (10 ROWS)

SELECT * FROM myTable TABLESAMPLE (5 PERCENT)
Return a representative sample
For large tables

Advanced SQL – Security
Cluster Setup
SQL Permissions
Table Level Security
Requires Databricks Premium SKU
Set on Cluster-Level
need to control access to cluster
Privileges
SELECT, CREATE, MODIFY, READ_METADATA, CREATE_NAMED_FUNCTION, ALL PRIVILEGES
Objects
CATALOG, DATABASE, TABLE, VIEW, FUNCTION, ANONYMOUS FUNCTION, ANY FILE
Advanced SQL – External Tables
Connect to any JDBC source
Exposed as regular SQL table

CREATE TABLE myJdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:<databaseServerType>://<jdbcHostname>:<jdbcPort>",
table "<jdbcDatabase>.myTable",
user "<jdbcUsername>",
password "<jdbcPassword>"
)
Distributed Processing – Round Robin
Row Customer Product Sales

1 John PC 100€ Customer Sales
Worker 1
Head node
3 Karl Printer 50€ John 180€
5 Karl Printer 60€ Karl 110€ Customer Sales
Customer Sales
7 John Printer 80€ John 180€
John 230€
Karl 110€
Karl 110€
Customer Sales Peter 200€
Row Customer Product Sales Customer Sales Mark 70€

Peter 200€
Mark 70€
2 Peter PC 200€ Peter 200€
Worker 2
John 150€
4 Mark Phone 70€ Mark 70€
6 John Scanner 150€ John 150€
Distributed Processing – By Column

1 John PC 100€ Customer Sales
Worker 1
3 Karl Printer 50€ Head node

John 230€
5 Karl Printer 60€
Karl 110€ Customer Sales
6 John Scanner 150€
John 230€
7 John Printer 80€
Karl 110€
Peter 200€
Mark 70€
Row Customer Product Sales Customer Sales
Worker 2
2 Peter PC 200€ Peter 200€

4 Mark Phone 70€ Mark 70€
Distributed Processing – Non-Additive

1 John PC 100€ Product
Worker 1
3 Karl Printer 50€

3
PC Head node
5 Karl Printer 60€ Printer

Product
Customer
6 John Scanner 150€ Scanner PC
PC
4
7 John Printer 80€ Printer
Scanner Printer
Product Scanner
PC Phone
Phone
Row Customer Product Sales Product
Worker 2
2
4
Peter
Mark
PC
Phone
200€
70€
2 PC
Phone

Databricks For The SQL Developer: Gerhard Brueckl

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Databricks For The SQL Developer: Gerhard Brueckl

Uploaded by

Copyright:

Available Formats

Databricks for the

What is Databricks / Spark?

Why use Databricks for SQL workloads?

SQL with Databricks / Spark

Advanced SQL techniques

Company that provides a Big Data processing

Creators of Apache® Spark™

Offers: Databricks on AWS, Azure Databricks

Open-source Cluster computing framework

Built for: Speed, Ease-of-Use, Extensibility

Support for multiple languages

Project of the Apache Foundation

Spark unifies: SQL

‘Driver’ runs the ‘main’ function and executes and

Worker nodes and the Driver Node execute as

Best of Databricks Best of Microsoft

Designed in collaboration with the founders of Apache Spark

One-click set up; streamlined workflows

IoT / streaming data Machine learning models

DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST

Deploy Production Jobs & Workflows

• Cloud-only solutions • Scalability

• Works with structured and • Native SQL integration

• Tables are just references and • No indexes

• Files must match the schema!

ANSI SQL • SELECT and INSERT only!

• Creates new file in folder Storage

CREATE TABLE AS SELECT

3. Databricks queries storage services for raw data

= Apache Hive Metastore

Metadata of all SQL objects

Delta Lake is an open-source storage layer that brings ACID

• ACID compliant transactions • Schema enforcement and evolution

• Support for UPDATE / MERGE • Batch & Streaming

• Time-Travel • 100% compatible with Apache Spark

CREATE TABLE IF NOT EXISTS DimProductDelta

Avoid defining Columns explicitly – handled by transaction log!

Clustering is not supported!

Always results in new files! Even a DELETE!

Old files are invalidated via _delta_log

Operations are logged in _delta_log

Conflicts have to be handled by the User!

Can create A LOT of files!

SET Price = 1300

DELETE FROM DimProduct

"add": { "remove": { "path": "part-01.parquet", ... },

Manually with OPTIMIZE command

Collapse many small files into few big files

Optimizes current/latest version only!

Bin-Packing / Ordering OPTIMIZE events

Manually with VACUUM command

Automatically with every INSERT/UPDATE/MERGE

Default retention period is 7 days

Blocks deletes and modifications of a table

Configures the number of columns for which statistics are collected

CREATE TABLE DimProductDelta

Get information about schema, partitioning, table size, …

User Defined Functions

User Defined Aggregates

Use Globally when packed as JAR

SELECT * FROM myTable TABLESAMPLE (10 ROWS)

Return a representative sample

For large tables