Snowflake To Lakehouse Migration Assessment 5-23

Migration Assessment &
Strategy Workshop
1
Tech leaders are to the right of the Data Maturity Curve
From hindsight to foresight
Supporting the business Automated
Decision
Making
Prescriptive
Analytics
Competitive Advantage
Predictive
Modeling Automatically make the best
decision
Data
Exploration
How should we
Ad Hoc respond?
Queries
What will
Reports
Clean happen?
Data
What happened? Powering the business
Data + AI Maturity 2
Why not use a cloud data warehouse?
Data Warehouses can’t fully support all your data
Structured
Semi-Structured
Unstructured
Limited support for

unstructured data
(audio/images/video)
ADLS S3 GCS
DATA LAKE
Lock-in / proprietary
Data is replicated format
Data Warehouses are inefficient with data transformation
Up to 6x the cost for ELT
workloads
Not optimized for
Data Engineering
Structured
Complex & many stages.
Semi-Structured
Limited support
for streaming
Unstructured
Limited support for

unstructured data
ADLS S3 GCS
DATA LAKE
Pay a premium for all workloads
Up to 6x the cost for ELT BI Reports, Dashboards & SQL
workloads Compute cost for all
Not optimized for data access ELT
Data Engineering
Structured
Semi-Structured Incompatible security

and governance models
Limited support
for streaming
Unstructured
Limited support for

unstructured data
ADLS S3 GCS
DATA LAKE
A cloud data warehouse is NOT a modern data platform
Up to 6x the cost for ELT BI Reports, Dashboards & SQL
workloads Compute cost for all
Not optimized for data access ELT
Data Engineering
Structured
Semi-Structured Incompatible security Model Serving

and governance models
Limited support
for streaming Not optimized for
Data Science
Unstructured
Limited support for

unstructured data Data is duplicated
ADLS S3 GCS ADLS S3 GCS Data Science Model Training
DATA LAKE DATA LAKE
Lock-in / proprietary Model Scoring Model Deployment
Disparate tooling decreases
data team productivity
The Databricks Lakehouse Platform
✓ Single Source of truth for all
your data Integrated and collaborative
role-based experiences with open APIs
✓ End-to-end ETL and Data engineering /

BI & SQL Data science & ML
Streaming capabilities Streaming
✓ High performance BI on your Common security, governance,

and administration
data lake
✓ First-class AI/ML capabilities Data processing and management built on open source and open
standards
and support
✓ Open, unified governance and Cloud Data Lake

security structured, semi-structured, and unstructured data
Simple Migration Process
Inefficient DE, ETL, Data Sharing Delta Lake / Live Tables / Sharing
(Snowpipe / Snowpark, Partner Integrations) (Industry-leading data engine, multiple language support)
Limited Real-Time Event Processing Databricks Jobs / Delta Lake / Structured

(Batch Processing in Snowpipe) Streaming
(Spark Structured Streaming + Delta Lake: Streaming + Batch ingest)
Slower, More Expensive SQL / BI Databricks SQL
(SQL UI, BI Connectors) (Native SQL support, SQL Endpoints, Optimized BI Integrations)
Non-Native DS / ML Databricks Machine Learning / MLFlow

(Dataiku, DataRobot, External Snowpark Scripts) (Data native, first-class DS / ML platform, multiple language support)
Required Partner Tools Open Architecture

(Filling in the Gaps) (Easy integration with all required tools, access to open-source community)
Migration Methodology
Phase 4
Phase 2 Phase 3 Phase 5
Phase 1 Discovery TurnKey Delivery
Assessment Strategy Execution
Proposal
Reference
Assessment, Technology implementation of a
Migration specific
Design, Tooling, mapping, migration production use case, Migration execution
discovery and
Accelerators, workshop, Overall migration and support
consultation
Sizing, Partners migration planning implementation
plan
Databricks Migration Team with/without Partner Databricks Partner Driven
Databricks PS Driven ( Assurance Package to

assist SIs)
9
Discovery & Assessment
10
Architectural Discovery - Snowflake
Begin with a review of the customer’s current Snowflake architecture
❏ ETL Environment
❏ Third-party Tools? (Fivetran, Talend, dbt, etc.)
❏ Data velocity
❏ Data Types Snowflake Migration Scoping
❏ BI Processes Questionnaire (leave behind)
❏ BI Tools
❏ Report/Dashboard requirements
❏ ML Use Cases
❏ Scale of use cases
❏ Model requirements (languages, libraries, compute)
❏ Use our Snowflake Profiler to understand current workloads; Partner Analyzers for deeper dive
11
Pointers for discovery of the current Snowflake architecture
• Understand the landscape

• Existing architecture
• Pain points? (ETL, DSML, streaming, etc.)
• Upstream and downstream teams / partners
• Consider the TCO of their architecture
• Inefficiencies? Data movement or copying?
• Look for unserved use cases or teams
• Current use cases not possible in Snowflake’s Data Cloud
• Future use cases you could serve in the Lakehouse
12
The Snowflake Profiler is an important step
The Snowflake Profiler is a notebook which
runs in your environment to answer core
questions:
● What is the breakdown of my

Snowflake usage by category?
● Where are these workloads running?
(Which warehouse, which users?)
● How do we expect these costs to
grow over time?
Reach out to your Partner Account Manager, Sales Leader, or email partners@databricks.com to
engage your Databricks counterparts.
Align on Target Architecture
Designing a Well Architected Lakehouse
14
Guiding Principles for the Lakehouse
Curate Data and Offer Trusted Data- Adopt an Organization-wide Data
as-Products Governance Strategy
Remove Data Silos and Minimize Encourage the Use of Open

Data Movement Interfaces and Open Formats
Democratize Value Creation through Build to Scale and Optimize for

Self-Service Experience Performance & Cost
15
Cloud Data Analytics Framework
Data Engineer ML Engineer Data Scientist Business Analyst Business Partners
ETL & DS tools BI Tools Consumption
Data
Workflow Ingest & Advanced Analytics, Data Sharing
Engine
Mgmt Transform ML & AI Warehouse
Data Governance Governance
Cloud Storage Storage
16
Cloud Data Analytics Framework
Data Engineer ML Engineer Data Scientist Business Analyst Business Partners
SQL
IDE support ETL & DS tools Notebooks
Editor BI Tools Consumption
Delta
Sharing
Workflows Auto Loader,
(Jobs, DLT) DLT SQL
ETL runtime ML runtime Model Serving Data
Connectors
Workflow Ingest & Advanced Analytics, Databricks Data SQL Sharing
Batch Engine
Mgmt Transform ML & AI Warehouse
Warehouse
Runtimes
Data Quality Streaming
Photon
(DLT)
Unity
Catalog Data Governance Governance
Amazon ADL Google Cloud

S3 S Storage Cloud Storage Proprietary
Supported: Storage
DWH files
Delta Parquet JSON CSV Avro Images ...
17
Databricks Lakehouse Reference Architecture
The Databricks Lakehouse Platform can support your core data workloads
Optional DW
18
Different Data Models on the Lakehouse
The Lakehouse supports any data model
Data Models are organizational processes. You cannot buy a process!
However, the right platform capabilities will ease the implementation of a particular data model
● Databricks Lakehouse has the technical enablers for teams to produce and consume data in a centralized or
decentralized but governed way
● Lakehouse is a polyglot technology that works with any data modeling concept
● Lakehouse applies at all scales (startups to large orgs)
● The Medallion architecture can fit into whatever strategy you want
BRONZE SILVER GOLD

Replica of Source mainly for landing, archiving, re- More normalized (2NF-like) / Integration Layer
processing and Lineage Purposes Data Vault-like / Write-optimized Star schema/ Kimball / Read-Optimized
19
animated
A Well Architected Lakehouse

Dimensional Modeling
1. Staging Staging Enterprise ODS/ Integration Presentation
a. Raw data in its original format (temporarily) raw data
(temp.) Data Mart
2. Ingestion IoT, Social Media
Product Domain Data Dim.
a. Raw data converted to Delta (from Avro, CSV, Landing (Autoloader)
Model
parquet, XML, JSON format in Landing) Customer Sales
3. Integration -Physical data model Domain Data Mart
a. Detailed information covering multiple subject Ingestion Dim.
areas raw Physical Data Model
Model
data
b. Integrates all data sources
c. Does not necessarily use a dimensional model but bronze silver gold
feeds dimensional models.
4. Data Mart
a. Subset of the Integrated layer, sometimes or
aggregated data
Dimensional model (star schema)
b. Focus on dimensional modeling with star schema
c. Typically oriented to a specific business line or Order
team Customer Dim Fact Dim Product
Implementing Data Modeling Techniques in the Databricks Lakehouse Platform Dim

Time
Data Modeling Best Practices in the Databricks Lakehouse Platform
animated

The Data Vault 2.0
1. Staging Staging Enterprise ODS/ Integration Presentation
a. Raw data in its original format (temporarily) raw data
(temp.) Raw Business Information
2. Ingestion: Mart
Vault Vault
a. Raw data converted to Delta (from Avro, CSV, Landing
parquet, XML, JSON format in Landing) Hub PIT
Business
3. Integration - Raw Vault: Data is modeled as
Link Bridge
Views SQ
a. Hubs (unique business keys) Ingestion L
b. Links (relationship and associations) raw Satellite Views
data
c. Satellites (descriptive data)
4. Integration - Business Vault: bronze silver gold
Tables with applied business rules, data quality rules,
cleansing and conforming rules ETL/
a. Business views Data Vault 2.0 model
b. Point-in-Time (PIT) tables (opt.)
ELT
Satellite
c. Bridge tables are created on top of the business
Satellite
vault (opt.)
Satellite
5. Presentation - Information Marts Satellite Hub Link Hub
a. Similar to a classical Data Mart with data that has Customer Product Satellite
Satellite
been cleansed and harmonized
b. Consumer-oriented models (typically views) Hub Order
Satellite Satellite
21
animated

The Data Mesh
1. Data Domain ownership: The lakehouse provides

an open, flexible architecture that allows for
distributed ownership of data assets for each
domain
2. Data as a product: Delta provides an open and
standard format for FAIR data, and Delta Live
Tables allow for high quality, reliable data
pipelines
3. Self-service infrastructure platform: Databricks
is a unified platform that can automate data
processes through the use of Workflows,
Terraform, and other tools
4. Federated governance: Unity Catalog ensures
global data discovery, access, and lineage
22
* Batch process with CDF, DLT, … ** currently lineage is restricted per

Snowflake To Lakehouse Migration Assessment 5-23

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Snowflake To Lakehouse Migration Assessment 5-23

Uploaded by

Copyright:

Available Formats

Migration Assessment &

What happened? Powering the business

Limited support for

Complex & many stages.

Limited support for

Complex & many stages.

Semi-Structured Incompatible security

Limited support for

Complex & many stages.

Semi-Structured Incompatible security Model Serving

Limited support for

✓ End-to-end ETL and Data engineering /

✓ High performance BI on your Common security, governance,

✓ Open, unified governance and Cloud Data Lake

Limited Real-Time Event Processing Databricks Jobs / Delta Lake / Structured

Non-Native DS / ML Databricks Machine Learning / MLFlow

Required Partner Tools Open Architecture

Databricks Migration Team with/without Partner Databricks Partner Driven

Databricks PS Driven ( Assurance Package to

• Understand the landscape

● What is the breakdown of my

Remove Data Silos and Minimize Encourage the Use of Open

Democratize Value Creation through Build to Scale and Optimize for

ETL & DS tools BI Tools Consumption

Data Governance Governance

Cloud Storage Storage

Amazon ADL Google Cloud

Data Models are organizational processes. You cannot buy a process!

BRONZE SILVER GOLD

A Well Architected Lakehouse

Implementing Data Modeling Techniques in the Databricks Lakehouse Platform Dim

A Well Architected Lakehouse

A Well Architected Lakehouse

1. Data Domain ownership: The lakehouse provides

You might also like