Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Databricks for the

SQL Developer

Gerhard Brueckl
Bronze Silver Gold Platinum

Our Partners
About me

@gbrueckl

blog.gbrueckl.at www.paiqo.com

gerhard@gbrueckl.at

https://github.com/gbrueckl
Agenda

What is Databricks / Spark?

Why use Databricks for SQL workloads?

SQL with Databricks / Spark

Delta Lake

Advanced SQL techniques


What is Databricks?

Company that provides a Big Data processing


solution in the Cloud using Apache Spark

Founded in 2013

Creators of Apache® Spark™

Offers: Databricks on AWS, Azure Databricks


NO on-prem solution!
What is Apache Spark?

Open-source Cluster computing framework


Runs on YARN, Mesos, …

Built for: Speed, Ease-of-Use, Extensibility

Support for multiple languages


Java, Scala, Python, R, SQL

Project of the Apache Foundation


Largest open-source data project
Apache Spark APIs

Spark unifies: SQL


• Batch Processing
• Interactive SQL
• Real-time processing
• Machine Learning
• Deep Learning
• Graph Processing
How does it work?

‘Driver’ runs the ‘main’ function and executes and


coordinates the various parallel operations on the
worker nodes
The worker nodes read and write data from/to
Data Sources including HDFS
Worker node also caches transformed data in
memory as RDDs (Resilient Distributed Data Sets)
The results of the operations are collected by the
driver and returned to the client.

Worker nodes and the Driver Node execute as


VMs in public clouds (AWS or Azure)
Azure Databricks
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure

Best of Databricks Best of Microsoft

Designed in collaboration with the founders of Apache Spark

One-click set up; streamlined workflows

Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)

Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
Azure Databricks
Collaborative Workspace

IoT / streaming data Machine learning models

DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST

Deploy Production Jobs & Workflows


BI tools
Cloud storage

MULTI-STAGE
JOB SCHEDULER NOTIFICATION & LOGS
PIPELINES
Data warehouses
Optimized Databricks Runtime Engine Data exports

Hadoop storage
DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs
Data warehouses

Enhance Productivity Build on secure & trusted cloud Scale without limits
Why use Databricks for SQL workloads?

• Cloud-only solutions • Scalability

• Works with structured and • Native SQL integration


unstructured data
• Native Azure integration
• Open Standard
• Apache Spark
• Single Tool for all Workloads

• Extensible
Batch processing only – no OLTP!!!
Spark SQL Fundamentals

• Tables are just references and • No indexes


metadata
• Location
• Column definitions
• (No Stored Procedures)
• Partitions


• (No Statistics)
• (similar to external tables in Polybase)

• Files must match the schema!


Supported SQL Features

ANSI SQL • SELECT and INSERT only!


• Joins
• Groupings/Aggregations
• Rollup/Cube/GroupingSets • Views
• Subselects
• Window Functions
• …
• Temporary tables
Transformations/Functions
Constantly evolving!
SQL SELECT and INSERT

SELECT
• Tables or Views SELECT
SELECT INSERT

• on files directly!

INSERT Databricks

• Creates new file in folder Storage

CREATE TABLE AS SELECT


• Persists result of a SQL query
Processing of a SQL Query
1. Client submits SQL Query
2. Databricks queries Meta Data Catalog
• Checks syntax
Meta Data Store • Checks columns
Cloud storage
• Returns storage locations

3. Databricks queries storage services for raw data


4. Data is loaded into memory of nodes
5. Data is processed on nodes using Spark
SQL Cloud storage
“Result”
Data is written directly to storage services
OR
Data is collected on driver and returned to client

Cloud storage
Databricks Meta Data Store

= Apache Hive Metastore

Metadata of all SQL objects


Databases, Tables, Columns, …

Managed by Databricks
OR
Hosted externally
MSSQL / Azure SQL
MySQL / Azure MySQL
Can be Shared!
Types of Tables

Managed
Stored inside Databricks
Azure Blob Storage
Filesystem not accessible from outside
DROP TABLE also deletes files!

Unmanaged
Usually stored externally
Azure Blob Storage / Azure Data Lake Store / …
Can be shared with other services
DEMO
Delta Lake – delta.io

Delta Lake is an open-source storage layer that brings ACID


transactions to Apache Spark™ and big data workloads.

• ACID compliant transactions • Schema enforcement and evolution


• Optimistic Concurrency Control • Across multiple files/folders

• Support for UPDATE / MERGE • Batch & Streaming

• Time-Travel • 100% compatible with Apache Spark


Delta Lake – CREATE TABLE

CREATE TABLE IF NOT EXISTS DimProductDelta


USING DELTA
PARTITIONED BY (ProductSubcategoryKey)
-- CLUSTERED BY (ProductKey) INTO 4 BUCKETS
LOCATION '/mnt/adls/tables/DimProductDelta'
TBLPROPERTIES ('myKey' = 'myValue')

Avoid defining Columns explicitly – handled by transaction log!

Clustering is not supported!


Delta Lake – UPDATE/DELETE/MERGE

Always results in new files! Even a DELETE!

Old files are invalidated via _delta_log

Operations are logged in _delta_log

Conflicts have to be handled by the User!

Can create A LOT of files!


Delta Lake – UPDATE
Product Price Product Price
UPDATE TABLE DimProduct
Notebook 900 € Notebook 900 €
User

SET Price = 1300


PC 1,500 € PC 1,300 €
WHERE Product = 'PC'
Tablet 500 € Tablet 500 €
000000000.json 000000001.json
_delta_log

"add": {
"remove": { "path": "part-01.parquet", ... },
"path": "part-01.parquet",
"add": { "path": "part-02.parquet", ... }
...
}
Storage

part-01 part-02
part-01
(3 rows) (3 rows) (3 rows)
Delta Lake –DELETE
Product Price Product Price
Notebook 900 € Notebook 900 €
User

DELETE FROM DimProduct


PC 1,500 € WHERE Product = 'PC' Tablet 500 €
Tablet 500 €
000000001.json 000000002.json
_delta_log

"add": { "remove": { "path": "part-01.parquet", ... },


"path": "part-01.parquet", "add": { "path": "part-02.parquet", ... }
... }
Storage

part-01 part-02
part-01
(3 rows) (3 rows) (2 rows)
Delta Lake – _delta_log
Delta Lake – _delta_log
Delta Lake – _delta_log – Schema / Stats
Delta Lake – _delta_log - UPDATE
Delta Lake – Optimization

Manually with OPTIMIZE command

Collapse many small files into few big files

Optimizes current/latest version only!

Bin-Packing / Ordering OPTIMIZE events


WHERE date = 20200101
ZORDER BY (eventType)
Delta Lake – Clean-Up

Manually with VACUUM command

Automatically with every INSERT/UPDATE/MERGE

Default retention period is 7 days


Can be changed via TBLPROPERTIES

VACUUM events
[RETAIN num HOURS] [DRY RUN]
Delta Lake – Table Properties

Clean-Up Settings
ALTER TABLE DimProductDelta SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '240 HOURS');
ALTER TABLE DimProductDelta SET TBLPROPERTIES ('delta.logRetentionDuration' = '240 HOURS’);

Blocks deletes and modifications of a table


'delta.appendOnly' = 'true'

Configures the number of columns for which statistics are collected


'delta.dataSkippingNumIndexedCols' = '5'

CREATE TABLE DimProductDelta


USING DELTA
TBLPROPERTIES ('key1' = 'val1', 'key2' = 'val2', ...)
Delta Lake – New Meta Tables / DMVs

Get information about schema, partitioning, table size, …


DESCRIBE DETAILS myDeltaTable;

Provides provenance information, including the operation, user, and so on, for
each write to a table
DESCRIBE HISTORY myDeltaTable;
DEMO
Advanced SQL – Extensions

User Defined Functions


Python or Scala

User Defined Aggregates


Scala only

Session-Level only!

Use Globally when packed as JAR


DEMO
Advanced SQL – Sampling
Sampling

SELECT * FROM myTable TABLESAMPLE (10 ROWS)


SELECT * FROM myTable TABLESAMPLE (5 PERCENT)

Return a representative sample

For large tables


Advanced SQL – Security
Cluster Setup
SQL Permissions
Table Level Security
Requires Databricks Premium SKU

Set on Cluster-Level
need to control access to cluster

Privileges
SELECT, CREATE, MODIFY, READ_METADATA, CREATE_NAMED_FUNCTION, ALL PRIVILEGES
Objects
CATALOG, DATABASE, TABLE, VIEW, FUNCTION, ANONYMOUS FUNCTION, ANY FILE
Advanced SQL – External Tables

Connect to any JDBC source

Exposed as regular SQL table


CREATE TABLE myJdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:<databaseServerType>://<jdbcHostname>:<jdbcPort>",
table "<jdbcDatabase>.myTable",
user "<jdbcUsername>",
password "<jdbcPassword>"
)
Distributed Processing – Round Robin

Row Customer Product Sales


1 John PC 100€ Customer Sales
Worker 1

Head node
3 Karl Printer 50€ John 180€
5 Karl Printer 60€ Karl 110€ Customer Sales
Customer Sales
7 John Printer 80€ John 180€
John 230€
Karl 110€
Karl 110€

Customer Sales Peter 200€

Row Customer Product Sales Customer Sales Mark 70€


Peter 200€
Mark 70€
2 Peter PC 200€ Peter 200€
Worker 2

John 150€
4 Mark Phone 70€ Mark 70€
6 John Scanner 150€ John 150€
Distributed Processing – By Column

Row Customer Product Sales


1 John PC 100€ Customer Sales
Worker 1

3 Karl Printer 50€ Head node


John 230€
5 Karl Printer 60€
Karl 110€ Customer Sales
6 John Scanner 150€
John 230€
7 John Printer 80€
Karl 110€
Peter 200€
Mark 70€
Row Customer Product Sales Customer Sales
Worker 2

2 Peter PC 200€ Peter 200€


4 Mark Phone 70€ Mark 70€
Distributed Processing – Non-Additive

Row Customer Product Sales


1 John PC 100€ Product
Worker 1

3 Karl Printer 50€


3
PC Head node

5 Karl Printer 60€ Printer


Product
Customer
6 John Scanner 150€ Scanner PC
PC

4
7 John Printer 80€ Printer
Scanner Printer

Product Scanner
PC Phone
Phone
Row Customer Product Sales Product
Worker 2

2
4
Peter
Mark
PC
Phone
200€
70€
2 PC
Phone

You might also like